Skip to content

Arctic Penguin Exploration: Unraveling Clusters in the Icy Domain with K-means clustering

Alt text source: @allison_horst https://github.com/allisonhorst/penguins

You have been asked to support a team of researchers who have been collecting data about penguins in Antartica!

Origin of this data : Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network.

The dataset consists of 5 columns.

  • culmen_length_mm: culmen length (mm)
  • culmen_depth_mm: culmen depth (mm)
  • flipper_length_mm: flipper length (mm)
  • body_mass_g: body mass (g)
  • sex: penguin sex

Unfortunately, they have not been able to record the species of penguin, but they know that there are three species that are native to the region: Adelie, Chinstrap, and Gentoo, so your task is to apply your data science skills to help them identify groups in the dataset!

# Import Required Packages
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# Loading and examining the dataset
penguins_df = pd.read_csv("data/penguins.csv")

# Examine data and clean columns as necessary
penguins_df.info()
display(penguins_df.isna().sum())
# There are 9 rows missing in the sex column; and 2 totally blank rows.
penguins_df[penguins_df.isna().any(axis=1)]
# This is a small amount of data overall (less than 3%) so we will get rid of 'na' data
penguins_df.dropna(subset=['sex'],inplace=True)
penguins_df.info()
penguins_df['sex'] = penguins_df['sex'].astype('category')
# What about outliers?
penguins_df.boxplot()
plt.show()
# 2 outliers are present in flipper_length_mm. In the interest of producing robust reproducible code, we will calculate the IQR and remove them that way instead 
flipperdata = penguins_df['flipper_length_mm']
IQR = np.quantile(flipperdata,[.25,.75])
lower_bound = np.median(flipperdata) - 1.5 * IQR[0]
upper_bound = np.median(flipperdata) + 1.5 * IQR[1]
# Filter data 
penguins_clean = penguins_df[(penguins_df['flipper_length_mm'] > lower_bound) & (penguins_df['flipper_length_mm'] < upper_bound)]
penguins_clean.info()
# Pre-process data: 1) One Hot Encoding 2) StandardScaler 3) save as penguins_processed

# 1. Creating dummy variables
df_dummies = pd.get_dummies(penguins_clean).drop("sex_.",axis=1)

# 2. Scaling
scaler = StandardScaler()
# Instantiate
model = scaler
# Fit & transform
X = model.fit_transform(df_dummies)
# 
# 3. penguins_processed
penguins_preprocessed = pd.DataFrame(data=X,columns=df_dummies.columns)
# PCA

pca = PCA()

pca.fit(penguins_preprocessed)
pca.transform(penguins_preprocessed)

features = range(0,pca.n_components_)

ax = plt.bar(features,pca.explained_variance_ratio_)
plt.xticks(features)
plt.xlabel('Features')
plt.ylabel('Explained variance')
plt.show()

n_components = sum(pca.explained_variance_ratio_ > 0.1)

pca = PCA(n_components=n_components)
penguins_PCA = pca.fit_transform(penguins_preprocessed)
# Determine optimal number of clusters

km_inertia = []
clusters_range = range(1,11)
# run multiple KMeans models to determine optimal clusters
for n_clusters in clusters_range:
    km_penguins = KMeans(n_clusters=n_clusters,random_state=42).fit(penguins_PCA)
    km_inertia.append(km_penguins.inertia_)
# plot inertia (y) vs clusters (x)
plt.plot(clusters_range,km_inertia)
plt.ylabel('Model inertia')
plt.xlabel('Number of clusters')
plt.title('n_cluster = 4 is optimal based on Elbow Method')
plt.show()
# Based on elbow analysis of plot
n_clusters = 4
# Run KMeans with 4 clusters
kmeans = KMeans(n_clusters=n_clusters,random_state=42).fit(penguins_PCA)
# Visualize the clusters in x-y plane
plt.scatter(penguins_PCA[:, 0], penguins_PCA[:,1], c=kmeans.labels_)
plt.xlabel('First principal component')
plt.ylabel('Second principal component')
plt.title(f'K-means clustering (K={n_clusters})')
plt.show()

# Create stats data frame
penguins_clean['label'] = kmeans.labels_
numeric_columns = ['culmen_length_mm', 'culmen_depth_mm', 'flipper_length_mm','label']
stat_penguins = penguins_clean[numeric_columns].groupby('label').mean()
stat_penguins