Skip to content
Unsupervised Learning in Python
Unsupervised Learning in Python
Run the hidden code cell below to import the data used in this course.
# Import the course packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
import scipy.stats 
# Import the course datasets 
grains = pd.read_csv('datasets/grains.csv')
fish = pd.read_csv('datasets/fish.csv', header=None)
wine = pd.read_csv('datasets/wine.csv')
eurovision = pd.read_csv('datasets/eurovision-2016.csv')
stocks = pd.read_csv('datasets/company-stock-movements-2010-2015-incl.csv', index_col=0)
digits = pd.read_csv('datasets/lcd-digits.csv', header=None)Explore Datasets
Use the DataFrames imported in the first cell to explore the data and practice your skills!
- You work for an agricultural research center. Your manager wants you to group seed varieties based on different measurements contained in the grainsDataFrame. They also want to know how your clustering solution compares to the seed types listed in the dataset (thevariety_numberandvarietycolumns). Try to use all of the relevant techniques you learned in Unsupervised Learning in Python!
- In the fishDataFrame, each row represents an individual fish. Standardize the features and cluster the fish by their measurements. You can then compare your cluster labels with the actual fish species (first column).
- In the wineDataFrame, there are threeclass_labelsin this dataset. Transform the features to get the most accurate clustering.
- In the eurovisionDataFrame, perform hierarchical clustering of the voting countries usingcompletelinkage and plot the resulting dendrogram.
Clustering grains
# Print the number of classes in grains
print(grains.nunique())grains.head(3)K-Means: find cluster of samples, the number of clusters must be specified
import pandas as pd
from sklearn.cluster import KMeans
samples = grains.iloc[:, :7]
model = KMeans(n_clusters=3, random_state=42)
model.fit(samples)
# Predict which class each sample belongs
labels = model.predict(samples)
print(labels)KMeans can predict which cluster label new unsee sample data by remembering the mean of each cluster (centroids). It will find the nearest centroid to each new sample
Quality of Clusters
import matplotlib.pyplot as plt
# Assign the columns of new_points: xs and ys
xs = samples.iloc[:, 0]
ys = samples.iloc[:, 1]
# Make a scatter plot of xs and ys, using labels to define the colors
plt.scatter(xs, ys, c=labels, alpha=0.5)
centroids = model.cluster_centers_
x = centroids[:, 0]
y = centroids[:, 1]
plt.scatter(x, y)
plt.title("KMeans clsutering")
plt.show()df = pd.DataFrame({'labels': labels, 'variety': grains['variety']})
print(df)ct = pd.crosstab(df['labels'], df['variety'])
print(ct)Inertia: measures how far samples are from their centroids. Lower this value better the quality of model
print(model.inertia_)