Skip to content
Project: Clustering Antarctic Penguin Species
Arctic Penguin Exploration: Unraveling Clusters in the Icy Domain with K-means clustering
source: @allison_horst https://github.com/allisonhorst/penguins
You have been asked to support a team of researchers who have been collecting data about penguins in Antartica!
Origin of this data : Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network.
The dataset consists of 5 columns.
- culmen_length_mm: culmen length (mm)
- culmen_depth_mm: culmen depth (mm)
- flipper_length_mm: flipper length (mm)
- body_mass_g: body mass (g)
- sex: penguin sex
Unfortunately, they have not been able to record the species of penguin, but they know that there are three species that are native to the region: Adelie, Chinstrap, and Gentoo, so your task is to apply your data science skills to help them identify groups in the dataset!
# Import Required Packages
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
# Loading and examining the dataset
penguins_df = pd.read_csv("data/penguins.csv")
# Examine data and clean columns as necessary
penguins_df.info()
display(penguins_df.isna().sum())
# There are 9 rows missing in the sex column; and 2 totally blank rows.
penguins_df[penguins_df.isna().any(axis=1)]
# This is a small amount of data overall (less than 3%) so we will get rid of 'na' data
penguins_df.dropna(subset=['sex'],inplace=True)
penguins_df.info()
penguins_df['sex'] = penguins_df['sex'].astype('category')
# What about outliers?
penguins_df.boxplot()
plt.show()
# 2 outliers are present in flipper_length_mm. In the interest of producing robust reproducible code, we will calculate the IQR and remove them that way instead
flipperdata = penguins_df['flipper_length_mm']
IQR = np.quantile(flipperdata,[.25,.75])
lower_bound = np.median(flipperdata) - 1.5 * IQR[0]
upper_bound = np.median(flipperdata) + 1.5 * IQR[1]
# Filter data
penguins_clean = penguins_df[(penguins_df['flipper_length_mm'] > lower_bound) & (penguins_df['flipper_length_mm'] < upper_bound)]
penguins_clean.info()# Pre-process data: 1) One Hot Encoding 2) StandardScaler 3) save as penguins_processed
# 1. Creating dummy variables
df_dummies = pd.get_dummies(penguins_clean).drop("sex_.",axis=1)
# 2. Scaling
scaler = StandardScaler()
# Instantiate
model = scaler
# Fit & transform
X = model.fit_transform(df_dummies)
#
# 3. penguins_processed
penguins_preprocessed = pd.DataFrame(data=X,columns=df_dummies.columns)
# PCA
pca = PCA()
pca.fit(penguins_preprocessed)
pca.transform(penguins_preprocessed)
features = range(0,pca.n_components_)
ax = plt.bar(features,pca.explained_variance_ratio_)
plt.xticks(features)
plt.xlabel('Features')
plt.ylabel('Explained variance')
plt.show()
n_components = sum(pca.explained_variance_ratio_ > 0.1)
pca = PCA(n_components=n_components)
penguins_PCA = pca.fit_transform(penguins_preprocessed)
# Determine optimal number of clusters
km_inertia = []
clusters_range = range(1,11)
# run multiple KMeans models to determine optimal clusters
for n_clusters in clusters_range:
km_penguins = KMeans(n_clusters=n_clusters,random_state=42).fit(penguins_PCA)
km_inertia.append(km_penguins.inertia_)
# plot inertia (y) vs clusters (x)
plt.plot(clusters_range,km_inertia)
plt.ylabel('Model inertia')
plt.xlabel('Number of clusters')
plt.title('n_cluster = 4 is optimal based on Elbow Method')
plt.show()
# Based on elbow analysis of plot
n_clusters = 4
# Run KMeans with 4 clusters
kmeans = KMeans(n_clusters=n_clusters,random_state=42).fit(penguins_PCA)
# Visualize the clusters in x-y plane
plt.scatter(penguins_PCA[:, 0], penguins_PCA[:,1], c=kmeans.labels_)
plt.xlabel('First principal component')
plt.ylabel('Second principal component')
plt.title(f'K-means clustering (K={n_clusters})')
plt.show()
# Create stats data frame
penguins_clean['label'] = kmeans.labels_
numeric_columns = ['culmen_length_mm', 'culmen_depth_mm', 'flipper_length_mm','label']
stat_penguins = penguins_clean[numeric_columns].groupby('label').mean()
stat_penguins