Skip to content

Alt text source: @allison_horst https://github.com/allisonhorst/penguins

You have been asked to support a team of researchers who have been collecting data about penguins in Antartica! The data is available in csv-Format as penguins.csv

Origin of this data : Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network.

The dataset consists of 5 columns.

ColumnDescription
culmen_length_mmculmen length (mm)
culmen_depth_mmculmen depth (mm)
flipper_length_mmflipper length (mm)
body_mass_gbody mass (g)
sexpenguin sex

Unfortunately, they have not been able to record the species of penguin, but they know that there are at least three species that are native to the region: Adelie, Chinstrap, and Gentoo. Your task is to apply your data science skills to help them identify groups in the dataset!

# Import Required Packages
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# Loading and examining the dataset
penguins_df = pd.read_csv("penguins.csv")
penguins_df.head()

This project will be divided in 4 steps:

  1. preprocess the data;
  2. detect the optimal number for k-means clustering
  3. run the k-means clustering
  4. create the final DataFrame stat_penguins where each row represents a cluster and the columns are the original variables.

Preprocessing the data

Firstly, the sex column is categorical. So, I need to convert it to dummies for male and female and then drop the original column. Secondly, let's check for any null values. Finally, we'll standarlize the numeric values and maintain the dummy column as a binary column, but this step will only be made through the model's pipeline, which'll be introduced in the next section.

# get dummies from the sex column
sex_dummy = pd.get_dummies(penguins_df['sex'], drop_first=True) # Male is set to 1, female to 0
penguins_df = pd.concat([penguins_df, sex_dummy], axis=1)
penguins_df.drop('sex', axis=1, inplace=True)

print(penguins_df.head())
# Checking for null values
print(penguins_df.info())
print('\n')
for column in list(penguins_df.columns):
    try:
        assert penguins_df[column].isna().all() == False
        print(f'There are no null values in the {column} column.')
    except AssertionError:
        print(f'There are null values in the {column} column.')

Detecting the optimal number of k-means

As the background text stated, there are at least three species of penguins in the study's location. So we hope tight clusters with at least three different clusters. I'll plot the inertia (how far the samples are from their cluster means, also known as centroids) of each cluster from 1 to 10, and apply the "elbow" method, that is, the point where inertia is starting to slow its decreasing.

The following pipeline will be used in this section and in the final k-mean clustering algorithm. As you'll see, it'll be a repetitive process in two different chunks of code but that's a necessary step to correctly identify the number of clusters.

# Making the pipeline

# Selecting the numerical columns
num_col = ['culmen_length_mm', 'culmen_depth_mm', 'flipper_length_mm', 'body_mass_g']

# Preprocess with ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), num_col)
    ],
    remainder='passthrough') # to leave the dummy column unchanged
ks = range(1, 11)
inertias = []

for k in ks: 
    pipeline = Pipeline([
        ('preprocess', preprocessor),
        ('kmean', KMeans(n_clusters=k, random_state=80))
    ])
    
    # Fitting the pipeline
    pipeline.fit(penguins_df)
    
    # Append the inertia
    inertias.append(pipeline.named_steps['kmean'].inertia_)

# Plotting ks vs inertia
plt.plot(ks, inertias, '-o')
plt.xlabel('Number of clusters, k')
plt.ylabel('inertia')
plt.xticks(ks)
plt.show()

The chart shows that the inertia drops rapidly up to 5 clusters. Beyond 5, its decrease becomes mild. So, for this analysis, I'll use 5 clusters.

Running the final 5 k-means clustering

# Keeping the preprocessor, I'll make a new pipeline with KMeans(n_clusters=5)

pipeline = Pipeline([
    ('preprocess', preprocessor),
    ('kmean', KMeans(n_clusters=5, random_state=80))
])

labels = pipeline.fit_predict(penguins_df)

penguins_df['label'] = labels
# Analyzing the labels

print(penguins_df.head())

# Plotting the numerical columns with the labels to visualize the patterns

plt.scatter(penguins_df['culmen_length_mm'], penguins_df['flipper_length_mm'], c=penguins_df['label'])
plt.title('Culmen length vs Flipper length')
plt.show()

plt.scatter(penguins_df['culmen_length_mm'], penguins_df['body_mass_g'], c=penguins_df['label'])
plt.title('Culmen length vs body mass')
plt.show()

Creating the final statistical DataFrame for each cluster

Each row is a label and each column is a numerical column, as the project asked.

# creating the stat_penguins

stat_penguins = penguins_df.groupby('label')[num_col].mean()

print(stat_penguins)