Skip to content
Project: Clustering techniques for ecological research
source: @allison_horst https://github.com/allisonhorst/penguins
You have been asked to support a team of researchers who have been collecting data about penguins in Antartica! The data is available in csv-Format as penguins.csv
Origin of this data : Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network.
The dataset consists of 5 columns.
| Column | Description |
|---|---|
| culmen_length_mm | culmen length (mm) |
| culmen_depth_mm | culmen depth (mm) |
| flipper_length_mm | flipper length (mm) |
| body_mass_g | body mass (g) |
| sex | penguin sex |
Unfortunately, they have not been able to record the species of penguin, but they know that there are at least three species that are native to the region: Adelie, Chinstrap, and Gentoo. Your task is to apply your data science skills to help them identify groups in the dataset!
# Import required packages
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
# Loading and examining the dataset
penguins_df = pd.read_csv("penguins.csv")
penguins_df.head()## Preprocess dataset
# Use get_dummies method on the categorical feature
penguins_dummies = pd.get_dummies(penguins_df['sex'], drop_first=True)
penguins = pd.concat([penguins_df, penguins_dummies], axis=1)
penguins = penguins.drop('sex', axis=1)
# Standardise and scale before clustering
scaler = StandardScaler()
penguins_scaled = scaler.fit_transform(penguins)## Detect the optimal number of clusters for k-means clustering
# Create an empty list to store inertia values
inertia_list = []
# Loop through cluster numbers from 1 to 9
for k in range(1, 10):
# Apply KMeans clustering
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(penguins_scaled)
# Store the inertia for each K
inertia_list.append(kmeans.inertia_)
# Visualize the inertia values (Elbow plot)
plt.figure(figsize=(8, 5))
plt.plot(range(1, 10), inertia_list, marker='o', linestyle='-', color='b')
plt.title('Elbow method for optimal number of clusters', fontweight='bold')
plt.xlabel('Number of clusters', fontweight='bold')
plt.ylabel('Inertia', fontweight='bold')
plt.grid(True)
plt.show()## Run the k-means clustering algorithm
# Apply KMeans with 6 clusters
kmeans = KMeans(n_clusters=6, random_state=42)
kmeans.fit(penguins_scaled)
# Get cluster labels for each data point
cluster_labels = kmeans.labels_
# Add cluster labels to the original DataFrame
penguins['cluster'] = cluster_labels## Visualize the clusters
# Calculate variance to guide feature selection
variance = penguins[['culmen_length_mm', 'culmen_depth_mm', 'flipper_length_mm', 'body_mass_g']].var()
# Display variance
print(variance.sort_values(ascending=False))
# Plot 2 features with the highest variance: body mass vs flipper length
plt.figure(figsize=(10, 6))
plt.scatter(penguins['body_mass_g'], penguins['flipper_length_mm'], c=cluster_labels,
edgecolor='k', s=80) # Edge and size of markers
plt.title('K-Means clustering (k=6): body mass vs flipper length', fontweight='bold')
plt.xlabel('Body mass (mm)', fontweight='bold')
plt.ylabel('Flipper length (mm)', fontweight='bold')
plt.grid(True)
plt.colorbar(label='Cluster') # Color bar to indicate cluster labels
plt.show()## Create a final characteristic table for each cluster
# Identify numeric (non-binary) columns
numeric_columns = ['culmen_length_mm', 'culmen_depth_mm', 'flipper_length_mm', 'body_mass_g']
# Add cluster labels to the dataset
penguins_df['label'] = kmeans.labels_ # Assign cluster labels from KMeans
# Aggregate data using groupby and calculate the mean for each cluster
stat_penguins = penguins_df.groupby('label')[numeric_columns].mean()
# Display the final characteristic DataFrame
print(stat_penguins)