Clustering Antarctic Penguin Species

Alt text source: @allison_horst https://github.com/allisonhorst/penguins

You have been asked to support a team of researchers who have been collecting data about penguins in Antartica! The data is available in csv-Format as penguins.csv

Origin of this data : Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network.

The dataset consists of 5 columns.

Column	Description
culmen_length_mm	culmen length (mm)
culmen_depth_mm	culmen depth (mm)
flipper_length_mm	flipper length (mm)
body_mass_g	body mass (g)
sex	penguin sex

Unfortunately, they have not been able to record the species of penguin, but they know that there are at least three species that are native to the region: Adelie, Chinstrap, and Gentoo. Your task is to apply your data science skills to help them identify groups in the dataset!

# Import Required Packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.manifold import TSNE

# Loading and examining the dataset
penguins_df = pd.read_csv("penguins.csv")
df = penguins_df.copy()
penguins_df.head()

# Checking for missing data
df.isna().sum()

# The sex column needs to be encoded
encoder = LabelEncoder()
df['sex'] = encoder.fit_transform(df['sex'])
df.head()

# Checking for outliers using box plots and saving the chart
plt.figure(figsize=(15, 10))

# List of columns to check for outliers
columns = ['culmen_length_mm', 'culmen_depth_mm', 'flipper_length_mm', 'body_mass_g']

for i, column in enumerate(columns, 1):
    plt.subplot(2, 2, i)
    sns.boxplot(x=df[column])
    plt.title(f'Box plot of {column}')

plt.tight_layout()
plt.savefig('outliers_boxplot.png')
plt.show()

# Scaling the data
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df)

# Determining the best k clusters
scores = []
for k in np.arange(1, 11):
    model = KMeans(n_clusters= k, random_state=42)
    model.fit(df_scaled)
    scores.append(model.inertia_)
    
# Plotting the elbow plot
plt.figure(figsize=(10, 6))
plt.plot(np.arange(1, 11), scores, marker='o')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method For Optimal k')
plt.savefig('elbow_plot.png')
plt.show()

# Determine the best k using the silhouette score
silhouette_scores = []

for k in np.arange(2, 11):  # Start from 2 instead of 1
    model = KMeans(n_clusters=k, random_state=42)
    labels = model.fit_predict(df_scaled)
    silhouette_avg = silhouette_score(df_scaled, labels)
    silhouette_scores.append(silhouette_avg)

# Plotting the silhouette scores
plt.figure(figsize=(10, 6))
plt.plot(np.arange(2, 11), silhouette_scores, marker='o')  # Adjust x-axis range
plt.title('Silhouette Score for Different k Values')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Silhouette Score')
plt.xticks(np.arange(2, 11))
plt.grid(True)
plt.savefig('silhouette_scores_plot.png')
plt.show()

# Final model with 5 clusters
model = KMeans(n_clusters=5, random_state=42)
labels = model.fit_predict(df_scaled)

# Visualizing the final clusters
plt.figure(figsize=(10, 6))
sns.scatterplot(x=df_scaled[:, 0], y=df_scaled[:, 1], hue=labels, palette='viridis', legend='full')
plt.scatter(model.cluster_centers_[:, 0], model.cluster_centers_[:, 1], s=300, c='red', label='Centroids')
plt.title('Final KMeans Clustering with k=5')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.savefig('final_kmeans_clustering.png')
plt.show()

# Calculating Silhouette Score
silhouette_avg = silhouette_score(df_scaled, labels)
print(f"Silhouette Score: {silhouette_avg:.3f}")

stat_penguins = df.drop('sex', axis=1)
stat_penguins["Cluster"] = labels
stat_penguins = round(stat_penguins.groupby("Cluster").mean(),2)

# Display the final DataFrame
stat_penguins.head()

# Applying t-SNE for dimensionality reduction
tsne = TSNE(n_components=3, random_state=42)
df_tsne = tsne.fit_transform(df_scaled)

# Visualizing the clusters with t-SNE
plt.figure(figsize=(10, 6))
sns.scatterplot(x=df_tsne[:, 0], y=df_tsne[:, 1], hue=labels, palette='viridis', legend='full')
plt.title('t-SNE enhanced Clusters Visualization')
plt.xlabel('t-SNE Feature 1')
plt.ylabel('t-SNE Feature 2')
plt.legend()
plt.savefig('t-SNE_clusters_visualization.png')
plt.show()

# Get the centroid coordinates
centroids = model.cluster_centers_

abb= scaler.inverse_transform(model.cluster_centers_)

# Convert to DataFrame
centroid_df = pd.DataFrame(centroids[:,:4], columns= penguins_df.drop('sex', axis=1).columns) 
centroid1_df = pd.DataFrame(abb[:,:4], columns= penguins_df.drop('sex', axis=1).columns)  

print(centroid_df)
print(centroid1_df)

Cluster Analysis Report

Results

After performing a cluster analysis on the penguins dataset, the final model was determined to have k = 5 clusters, with a silhouette score of 0.520, indicating a reasonably well-defined clustering structure.

Summary Statistics of Clusters:

Cluster	Culmen Length (mm)	Culmen Depth (mm)	Flipper Length (mm)	Body Mass (g)
0	40.32	19.01	192.24	4034.64
1	45.56	14.24	212.71	4679.74
2	39.74	17.59	188.86	3410.68
3	50.96	19.20	199.08	3920.62
4	49.47	15.72	221.54	5484.84

Interpretation of Clusters:

Cluster 0: Penguins with moderate culmen length and depth, average flipper length, and moderate body mass.
Cluster 1: Penguins with the longest flippers and high body mass, but shallower culmen depth.
Cluster 2: Penguins with the shortest flippers and lowest body mass, suggesting a smaller size category.
Cluster 3: Penguins with the longest culmen length and moderate body mass.
Cluster 4: Penguins with the highest body mass and long flippers, potentially the largest species in the dataset.

Evaluation:

The silhouette score of 0.520 indicates moderate cluster separation.
Cluster centroids were analysed to understand the characteristics of each group.

Conclusion

The k-means clustering model successfully identified five distinct groups of penguins based on culmen length, culmen depth, flipper length, and body mass. These clusters provide insights into different morphological patterns among the penguin population, which could be useful for further biological studies or conservation efforts. Further fine-tuning, such as incorporating additional features or testing hierarchical clustering, could improve segmentation clarity.