Skip to content

Arctic Penguin Exploration: Unraveling Clusters in the Icy Domain with K-means clustering

Alt text source: @allison_horst https://github.com/allisonhorst/penguins

You have been asked to support a team of researchers who have been collecting data about penguins in Antartica!

Origin of this data : Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network.

The dataset consists of 5 columns.

  • culmen_length_mm: culmen length (mm)
  • culmen_depth_mm: culmen depth (mm)
  • flipper_length_mm: flipper length (mm)
  • body_mass_g: body mass (g)
  • sex: penguin sex

Unfortunately, they have not been able to record the species of penguin, but they know that there are three species that are native to the region: Adelie, Chinstrap, and Gentoo, so your task is to apply your data science skills to help them identify groups in the dataset!

# Import Required Packages
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# Loading and examining the dataset
penguins_df = pd.read_csv("data/penguins.csv")

penguins_clean=penguins_df.copy()
penguins_clean=penguins_clean.dropna(how='any')

penguins_clean=penguins_clean.loc[penguins_clean['flipper_length_mm']<4500]
penguins_clean=penguins_clean.loc[penguins_clean['flipper_length_mm']>0]
penguins_clean=penguins_clean.reset_index(drop=True)
print(penguins_clean.columns)
df=pd.get_dummies(penguins_clean)
print(df.columns)
df=df.drop(columns='sex_.')
print(df.columns)
scaler=StandardScaler()
X=scaler.fit_transform(df)
penguins_preprocessed=pd.DataFrame(data=X,columns=df.columns)

pca=PCA()
df_pca=pca.fit(penguins_preprocessed)
n_components=sum(df_pca.explained_variance_ratio_>0.1)

pca=PCA(n_components=n_components)
penguins_PCA=pca.fit_transform(penguins_preprocessed)
                      


                      
inertia=[]
for k in range(1,11):
    kmeans=KMeans(n_clusters=k,random_state=42)
    kmeans=kmeans.fit(penguins_PCA)
    inertia.append(kmeans.inertia_)
plt.plot(range(1,11),inertia,marker='o')
plt.xlabel('Number of Clusters')
plt.ylabel('Inertia')
plt.title('Elbow Method')
plt.show()

n_clusters=4

kmeans=KMeans(n_clusters=4,random_state=42).fit(penguins_PCA)

plt.scatter(penguins_PCA[:,0],penguins_PCA[:,1],c=kmeans.labels_)
plt.xlabel('First PCA Component')
plt.ylabel('Second PCA Component')
plt.title('Kmeans Clustering (k=4)')
plt.show()

penguins_clean['label']=kmeans.labels_

stat_penguins=penguins_clean.groupby('label')['culmen_length_mm','culmen_depth_mm','flipper_length_mm'].mean()
print(stat_penguins)