Skip to content

Arctic Penguin Exploration: Unraveling Clusters in the Icy Domain with K-means clustering

Alt text source: @allison_horst https://github.com/allisonhorst/penguins

You have been asked to support a team of researchers who have been collecting data about penguins in Antartica!

Origin of this data : Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network.

The dataset consists of 5 columns.

  • culmen_length_mm: culmen length (mm)
  • culmen_depth_mm: culmen depth (mm)
  • flipper_length_mm: flipper length (mm)
  • body_mass_g: body mass (g)
  • sex: penguin sex

Unfortunately, they have not been able to record the species of penguin, but they know that there are three species that are native to the region: Adelie, Chinstrap, and Gentoo, so your task is to apply your data science skills to help them identify groups in the dataset!


1 hidden cell

How to approach the project

  1. Loading and examining the dataset

  2. Dealing with null values and outliers

  3. Perform preprocessing steps on the dataset to create dummy variables

  4. Perform preprocessing steps on the dataset - scaling

  5. Perform PCA

  6. Detect the optimal number of clusters for k-means clustering

  7. Run the k-means clustering algorithm

  8. Create a final statistical DataFrame for each cluster.

# Import Required Packages
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# Loading and examining the dataset
penguins_df = pd.read_csv("data/penguins.csv")
# Investigate the dataset
display(penguins_df.head())
display(penguins_df.describe())
display(penguins_df.info())
display(penguins_df.isna().sum())
# Visualising the dataset
sns.boxplot(penguins_df.iloc[:, :4])
print("There are outliers in column 'flipper_length mm'")
# Dealing with na & outliers
# Remove na values
penguins_clean = penguins_df.dropna()
print(penguins_clean.isna().sum())

# Remove outliers
# Find the max & min values in the 'flipper_length_mm' column
max_flipper_length = penguins_clean['flipper_length_mm'].max()
min_flipper_length = penguins_clean['flipper_length_mm'].min()

# Find the index label(s) corresponding to the max & min value
max_index = penguins_clean[penguins_clean['flipper_length_mm'] == max_flipper_length].index
min_index = penguins_clean[penguins_clean['flipper_length_mm'] == min_flipper_length].index

# Concatenate the index arrays into a single array
indexes_to_drop = max_index.append(min_index)

# Drop the row(s) with the maximum flipper length
penguins_clean.drop(indexes_to_drop, inplace=True)

# Re-visualise the clean dataset
sns.boxplot(penguins_clean.iloc[:, :4])
# Perform preprocessing steps on the dataset to create dummy variables
# Create the dummy variables and remove the original categorical feature from the dataset.
encouded = pd.get_dummies(penguins_clean, columns= ['sex']).drop(columns=['sex_.'])

# # Scale the data using the standard scaling method.
scaler = StandardScaler()
X = scaler.fit_transform(encouded)
StandardScaler(copy=True, with_mean=True, with_std=True)

# #Save the updated data as a new DataFrame called _penguins_preprocessed_.
penguins_preprocessed = pd.DataFrame(X, columns=encouded.columns)
# Perform Principle Components Analysis ("PCA")
model = PCA()
model.fit(penguins_preprocessed)
var_ratio = model.explained_variance_ratio_
n_components = sum(var_ratio > 0.1)
PCA(n_components= n_components)
penguins_PCA = model.fit_transform(penguins_preprocessed)
# Detect the optimal number of clusters for k-means clustering
inertia = []

# Elbow analysis
for k in range(1, 10):
    kmeans = KMeans(n_clusters=k, random_state=42,).fit(penguins_PCA)
    inertia.append(kmeans.inertia_)
# Visualise the inertia
plt.plot(range(1, 10), inertia, marker='o')
plt.xlabel('n_clusters')
plt.ylabel('Intertia')
plt.title('Elbow method')
plt.show()

# Decide the numer of clusters
n_clusters = 4
# Run the k-means clustering algorithm
kmeans = KMeans(n_clusters=4, random_state=42,).fit(penguins_PCA)

# Visualise the clusters
plt.scatter(penguins_PCA[:, 0], penguins_PCA[:, 1], c=kmeans.labels_)
# Create a final statistical DataFrame for each cluster.
penguins_clean['label'] = kmeans.labels_
numeric_columns = ['culmen_length_mm', 'culmen_depth_mm', 'flipper_length_mm','label']
stat_penguins = penguins_clean.groupby('label')[numeric_columns].mean()
print(stat_penguins)
# Print
print(penguins_PCA.shape)
display(penguins_clean)