Skip to content
Project: Clustering Antarctic Penguin Species
Arctic Penguin Exploration: Unraveling Clusters in the Icy Domain with K-means clustering
source: @allison_horst https://github.com/allisonhorst/penguins
You have been asked to support a team of researchers who have been collecting data about penguins in Antartica!
Origin of this data : Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network.
The dataset consists of 5 columns.
- culmen_length_mm: culmen length (mm)
- culmen_depth_mm: culmen depth (mm)
- flipper_length_mm: flipper length (mm)
- body_mass_g: body mass (g)
- sex: penguin sex
Unfortunately, they have not been able to record the species of penguin, but they know that there are three species that are native to the region: Adelie, Chinstrap, and Gentoo, so your task is to apply your data science skills to help them identify groups in the dataset!
# Import Required Packages
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
# Loading and examining the dataset
penguins_df = pd.read_csv("data/penguins.csv")
penguins_clean=penguins_df.copy()
penguins_clean=penguins_clean.dropna(how='any')
penguins_clean=penguins_clean.loc[penguins_clean['flipper_length_mm']<4500]
penguins_clean=penguins_clean.loc[penguins_clean['flipper_length_mm']>0]
penguins_clean=penguins_clean.reset_index(drop=True)
print(penguins_clean.columns)
df=pd.get_dummies(penguins_clean)
print(df.columns)
df=df.drop(columns='sex_.')
print(df.columns)
scaler=StandardScaler()
X=scaler.fit_transform(df)
penguins_preprocessed=pd.DataFrame(data=X,columns=df.columns)
pca=PCA()
df_pca=pca.fit(penguins_preprocessed)
n_components=sum(df_pca.explained_variance_ratio_>0.1)
pca=PCA(n_components=n_components)
penguins_PCA=pca.fit_transform(penguins_preprocessed)
inertia=[]
for k in range(1,11):
kmeans=KMeans(n_clusters=k,random_state=42)
kmeans=kmeans.fit(penguins_PCA)
inertia.append(kmeans.inertia_)
plt.plot(range(1,11),inertia,marker='o')
plt.xlabel('Number of Clusters')
plt.ylabel('Inertia')
plt.title('Elbow Method')
plt.show()
n_clusters=4
kmeans=KMeans(n_clusters=4,random_state=42).fit(penguins_PCA)
plt.scatter(penguins_PCA[:,0],penguins_PCA[:,1],c=kmeans.labels_)
plt.xlabel('First PCA Component')
plt.ylabel('Second PCA Component')
plt.title('Kmeans Clustering (k=4)')
plt.show()
penguins_clean['label']=kmeans.labels_
stat_penguins=penguins_clean.groupby('label')['culmen_length_mm','culmen_depth_mm','flipper_length_mm'].mean()
print(stat_penguins)