Skip to content
Unsupervised Learning in Python
Unsupervised Learning in Python
Run the hidden code cell below to import the data used in this course.
# Import the course packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
import scipy.stats
# Import the course datasets
grains = pd.read_csv('datasets/grains.csv')
fish = pd.read_csv('datasets/fish.csv', header=None)
wine = pd.read_csv('datasets/wine.csv')
eurovision = pd.read_csv('datasets/eurovision-2016.csv')
stocks = pd.read_csv('datasets/company-stock-movements-2010-2015-incl.csv', index_col=0)
digits = pd.read_csv('datasets/lcd-digits.csv', header=None)
Take Notes
Add notes about the concepts you've learned and code cells with code you want to keep.
Add your notes here
# Add your code snippets here
Explore Datasets
Use the DataFrames imported in the first cell to explore the data and practice your skills!
- You work for an agricultural research center. Your manager wants you to group seed varieties based on different measurements contained in the
grains
DataFrame. They also want to know how your clustering solution compares to the seed types listed in the dataset (thevariety_number
andvariety
columns). Try to use all of the relevant techniques you learned in Unsupervised Learning in Python! - In the
fish
DataFrame, each row represents an individual fish. Standardize the features and cluster the fish by their measurements. You can then compare your cluster labels with the actual fish species (first column). - In the
wine
DataFrame, there are threeclass_labels
in this dataset. Transform the features to get the most accurate clustering. - In the
eurovision
DataFrame, perform hierarchical clustering of the voting countries usingcomplete
linkage and plot the resulting dendrogram.
grains.variety.value_counts()
# print(grains.iloc[:,0].values)
# print(grains.iloc[:,2].values)
grains_56 = grains[['5','6','variety']]
grains_56.sample(6)
sns.scatterplot(data=grains_56, x='5', y='6', hue="variety")
sns.pairplot(grains.drop('variety_number', axis=1))
# pd.plotting.scatter_matrix(grain_vals_df, figsize=(10,10))
# x = grains.iloc[:,0].values
# y = grains.iloc[:,6].values
# plt.scatter(x, y,
# # c=grains.variety,
# alpha=0.6)
# plt.show()
from sklearn.cluster import KMeans
model = KMeans(n_clusters=7)
model.fit(grains.iloc[:,:7])
centroids = model.cluster_centers_
print(centroids)
ks = range(1, 11)
inertias = []
for k in ks:
# Create a KMeans instance with k clusters: model
model = KMeans(n_clusters=k)
# Fit model to samples
model.fit(grains.iloc[:,:7])
# Append the inertia to the list of inertias
inertias.append(model.inertia_)
# Plot ks vs inertias
plt.plot(ks, inertias, '-o')
plt.xlabel('number of clusters, k')
plt.ylabel('inertia')
plt.xticks(ks)
plt.show()
model = KMeans(n_clusters=3)
model.fit(grains.iloc[:,:7])
centroids = model.cluster_centers_
print(centroids)
# Use fit_predict to fit model and obtain cluster labels: labels
labels = model.fit_predict(grains.iloc[:,:7]) # .fit_predict() for a model class; .predict() for a pipeline class
varieties = grains.variety
# Create a DataFrame with labels and varieties as columns: df
df = pd.DataFrame({'labels': labels, 'varieties': varieties})
# Create crosstab: ct
ct = pd.crosstab(df['labels'], df['varieties'])
# Display ct
print(ct)
# Perform the necessary imports
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
# Create scaler: scaler
scaler = StandardScaler()
# Create KMeans instance: kmeans
kmeans = KMeans(n_clusters=3)
# Create pipeline: pipeline
pipeline = make_pipeline(scaler, kmeans)
# Fit the pipeline to samples
pipeline.fit(grains.iloc[:,:7])
# Calculate the cluster labels: labels
labels = pipeline.predict(grains.iloc[:,:7])
# Create a DataFrame with labels and species as columns: df
df = pd.DataFrame({'labels':labels, 'variety':grains.variety})
# Create crosstab: ct
ct = pd.crosstab(df['labels'], df['variety'])
# Display ct
print(ct)