Project: Clustering Antarctic Penguin Species

Alt text source: @allison_horst https://github.com/allisonhorst/penguins

You have been asked to support a team of researchers who have been collecting data about penguins in Antartica! The data is available in csv-Format as penguins.csv

Origin of this data : Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network.

The dataset consists of 5 columns.

Column	Description
culmen_length_mm	culmen length (mm)
culmen_depth_mm	culmen depth (mm)
flipper_length_mm	flipper length (mm)
body_mass_g	body mass (g)
sex	penguin sex

Unfortunately, they have not been able to record the species of penguin, but they know that there are at least three species that are native to the region: Adelie, Chinstrap, and Gentoo. Your task is to apply your data science skills to help them identify groups in the dataset!

# Import Required Packages
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# Loading and examining the dataset
penguins_df = pd.read_csv("penguins.csv")
penguins_df.head()

# Import Required Packages
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# Step 1 - Loading and examining the dataset
penguins_df = pd.read_csv("penguins.csv")
print(penguins_df.head())
print(penguins_df.info())

# Step 2 - Perform preprocessing steps on the dataset to create dummy variables
# Convert categorical variables into dummy/indicator variables
penguins_df = pd.get_dummies(penguins_df, columns=['sex'], dtype='int') # Convert only the 'sex' column

# Step 3 - Perform preprocessing steps on the dataset - standardizing/scaling
# Scaling variables (also called standardizing) is recommended before performing a clustering algorithm
numeric_features = ['culmen_length_mm', 'culmen_depth_mm', 'flipper_length_mm', 'body_mass_g']
X = penguins_df[numeric_features]
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
penguins_preprocessed = pd.DataFrame(data=X_scaled, columns=numeric_features)

# Step 4 - Detect the optimal number of clusters for k-means clustering
inertia = []
for k in range(1, 10):
    kmeans = KMeans(n_clusters=k, random_state=42).fit(penguins_preprocessed)
    inertia.append(kmeans.inertia_)    
plt.plot(range(1, 10), inertia, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')
plt.title('Elbow Method')
plt.show()
n_clusters = 3  # Adjusted based on the elbow plot

# Step 5 - Run the k-means clustering algorithm with the optimal number of clusters
kmeans = KMeans(n_clusters=n_clusters, random_state=42).fit(penguins_preprocessed)
penguins_df['label'] = kmeans.labels_

# Visualize the clusters (example using two features)
plt.scatter(penguins_df['culmen_length_mm'], penguins_df['flipper_length_mm'], c=penguins_df['label'], cmap='viridis')
plt.xlabel('Culmen Length (mm)')
plt.ylabel('Flipper Length (mm)')
plt.title(f'K-means Clustering (K={n_clusters})')
plt.colorbar(label='Cluster')
plt.show()

# Step 6 - Create final `stat_penguins` DataFrame
stat_penguins = penguins_df[numeric_features + ['label']].groupby('label').mean()
print(stat_penguins)

A brief explanation of each part of the code

Import Required Packages

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

pandas: Used for data manipulation and analysis.
matplotlib.pyplot: Used for creating plots and visualizations.
KMeans: A clustering algorithm from sklearn for unsupervised learning.
StandardScaler: A tool from sklearn for standardizing features.

2. Loading and Examining the Dataset

penguins_df = pd.read_csv("penguins.csv")
print(penguins_df.head())
print(penguins_df.info())

pd.read_csv("penguins.csv"): Reads the CSV file into a DataFrame.
print(penguins_df.head()): Displays the first few rows of the dataset.
print(penguins_df.info()): Provides a summary of the DataFrame, including column names and data types.

3. Preprocessing Steps

penguins_df = pd.get_dummies(penguins_df, columns=['sex'], dtype='int')

pd.get_dummies(penguins_df, columns=['sex'], dtype='int'): Converts the sex column, which is categorical, into dummy/indicator variables (0s and 1s). This makes it suitable for clustering as algorithms require numerical input.

4. Standardizing/Scaling Data

numeric_features = ['culmen_length_mm', 'culmen_depth_mm', 'flipper_length_mm', 'body_mass_g']
X = penguins_df[numeric_features]
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
penguins_preprocessed = pd.DataFrame(data=X_scaled, columns=numeric_features)

numeric_features: List of columns to use for clustering.
X = penguins_df[numeric_features]: Selects only the numeric columns for scaling.
StandardScaler(): Creates an instance of the scaler.
scaler.fit_transform(X): Scales the numeric features to have a mean of 0 and a standard deviation of 1.
pd.DataFrame(data=X_scaled, columns=numeric_features): Creates a new DataFrame with the scaled data.

5. Detecting the Optimal Number of Clusters

inertia = []
for k in range(1, 10):
    kmeans = KMeans(n_clusters=k, random_state=42).fit(penguins_preprocessed)
    inertia.append(kmeans.inertia_)
plt.plot(range(1, 10), inertia, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')
plt.title('Elbow Method')
plt.show()
n_clusters = 3

inertia = []: Initializes a list to store inertia values.
for k in range(1, 10): Iterates over a range of cluster numbers (from 1 to 9).
kmeans = KMeans(n_clusters=k, random_state=42).fit(penguins_preprocessed): Fits KMeans clustering with k clusters.
inertia.append(kmeans.inertia_): Appends the inertia value (sum of squared distances to closest cluster center) for each k.
plt.plot(range(1, 10), inertia, marker='o'): Plots the inertia values to visualize the elbow.
plt.xlabel('Number of clusters'): Labels the x-axis.
plt.ylabel('Inertia'): Labels the y-axis.
plt.title('Elbow Method'): Sets the title of the plot.
plt.show(): Displays the plot.
n_clusters = 3: Selects the number of clusters based on the elbow plot.

6. Running the K-means Clustering Algorithm

kmeans = KMeans(n_clusters=n_clusters, random_state=42).fit(penguins_preprocessed)
penguins_df['label'] = kmeans.labels_

kmeans = KMeans(n_clusters=n_clusters, random_state=42).fit(penguins_preprocessed): Fits KMeans with the chosen number of clusters.
penguins_df['label'] = kmeans.labels_: Adds the cluster labels to the original DataFrame.

7. Visualizing the Clusters

plt.scatter(penguins_df['culmen_length_mm'], penguins_df['flipper_length_mm'], c=penguins_df['label'], cmap='viridis')
plt.xlabel('Culmen Length (mm)')
plt.ylabel('Flipper Length (mm)')
plt.title(f'K-means Clustering (K={n_clusters})')
plt.colorbar(label='Cluster')
plt.show()

plt.scatter(penguins_df['culmen_length_mm'], penguins_df['flipper_length_mm'], c=penguins_df['label'], cmap='viridis'): Creates a scatter plot of culmen_length_mm vs. flipper_length_mm, colored by cluster labels.
plt.xlabel('Culmen Length (mm)'): Labels the x-axis.
plt.ylabel('Flipper Length (mm)'): Labels the y-axis.
plt.title(f'K-means Clustering (K={n_clusters})'): Sets the title of the plot.
plt.colorbar(label='Cluster'): Adds a color bar to indicate cluster labels.
plt.show(): Displays the plot.

8. Creating the `stat_penguins` DataFrame

stat_penguins = penguins_df[numeric_features + ['label']].groupby('label').mean()
print(stat_penguins)

penguins_df[numeric_features + ['label']]: Selects the numeric columns and cluster labels.
groupby('label').mean(): Groups by the cluster labels and calculates the mean for each numeric feature within each cluster.
print(stat_penguins): Displays the average values of the features for each cluster.