Skip to content
Project: Clustering Antarctic Penguin Species
source: @allison_horst https://github.com/allisonhorst/penguins
You have been asked to support a team of researchers who have been collecting data about penguins in Antartica! The data is available in csv-Format as penguins.csv
Origin of this data : Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network.
The dataset consists of 5 columns.
| Column | Description |
|---|---|
| culmen_length_mm | culmen length (mm) |
| culmen_depth_mm | culmen depth (mm) |
| flipper_length_mm | flipper length (mm) |
| body_mass_g | body mass (g) |
| sex | penguin sex |
Unfortunately, they have not been able to record the species of penguin, but they know that there are at least three species that are native to the region: Adelie, Chinstrap, and Gentoo. Your task is to apply your data science skills to help them identify groups in the dataset!
# Import Required Packages
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
# Loading and examining the dataset
penguins_df = pd.read_csv("penguins.csv")
penguins_df.head()# Import Required Packages
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
# Step 1 - Loading and examining the dataset
penguins_df = pd.read_csv("penguins.csv")
print(penguins_df.head())
print(penguins_df.info())
# Step 2 - Perform preprocessing steps on the dataset to create dummy variables
# Convert categorical variables into dummy/indicator variables
penguins_df = pd.get_dummies(penguins_df, columns=['sex'], dtype='int') # Convert only the 'sex' column
# Step 3 - Perform preprocessing steps on the dataset - standardizing/scaling
# Scaling variables (also called standardizing) is recommended before performing a clustering algorithm
numeric_features = ['culmen_length_mm', 'culmen_depth_mm', 'flipper_length_mm', 'body_mass_g']
X = penguins_df[numeric_features]
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
penguins_preprocessed = pd.DataFrame(data=X_scaled, columns=numeric_features)
# Step 4 - Detect the optimal number of clusters for k-means clustering
inertia = []
for k in range(1, 10):
kmeans = KMeans(n_clusters=k, random_state=42).fit(penguins_preprocessed)
inertia.append(kmeans.inertia_)
plt.plot(range(1, 10), inertia, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')
plt.title('Elbow Method')
plt.show()
n_clusters = 3 # Adjusted based on the elbow plot
# Step 5 - Run the k-means clustering algorithm with the optimal number of clusters
kmeans = KMeans(n_clusters=n_clusters, random_state=42).fit(penguins_preprocessed)
penguins_df['label'] = kmeans.labels_
# Visualize the clusters (example using two features)
plt.scatter(penguins_df['culmen_length_mm'], penguins_df['flipper_length_mm'], c=penguins_df['label'], cmap='viridis')
plt.xlabel('Culmen Length (mm)')
plt.ylabel('Flipper Length (mm)')
plt.title(f'K-means Clustering (K={n_clusters})')
plt.colorbar(label='Cluster')
plt.show()
# Step 6 - Create final `stat_penguins` DataFrame
stat_penguins = penguins_df[numeric_features + ['label']].groupby('label').mean()
print(stat_penguins)
A brief explanation of each part of the code
- Import Required Packages
import pandas as pd import matplotlib.pyplot as plt from sklearn.cluster import KMeans from sklearn.preprocessing import StandardScaler
pandas: Used for data manipulation and analysis.matplotlib.pyplot: Used for creating plots and visualizations.KMeans: A clustering algorithm fromsklearnfor unsupervised learning.StandardScaler: A tool fromsklearnfor standardizing features.
2. Loading and Examining the Dataset
penguins_df = pd.read_csv("penguins.csv") print(penguins_df.head()) print(penguins_df.info())
pd.read_csv("penguins.csv"): Reads the CSV file into a DataFrame.print(penguins_df.head()): Displays the first few rows of the dataset.print(penguins_df.info()): Provides a summary of the DataFrame, including column names and data types.
3. Preprocessing Steps
penguins_df = pd.get_dummies(penguins_df, columns=['sex'], dtype='int')
pd.get_dummies(penguins_df, columns=['sex'], dtype='int'): Converts thesexcolumn, which is categorical, into dummy/indicator variables (0s and 1s). This makes it suitable for clustering as algorithms require numerical input.
4. Standardizing/Scaling Data
numeric_features = ['culmen_length_mm', 'culmen_depth_mm', 'flipper_length_mm', 'body_mass_g'] X = penguins_df[numeric_features] scaler = StandardScaler() X_scaled = scaler.fit_transform(X) penguins_preprocessed = pd.DataFrame(data=X_scaled, columns=numeric_features)
numeric_features: List of columns to use for clustering.X = penguins_df[numeric_features]: Selects only the numeric columns for scaling.StandardScaler(): Creates an instance of the scaler.scaler.fit_transform(X): Scales the numeric features to have a mean of 0 and a standard deviation of 1.pd.DataFrame(data=X_scaled, columns=numeric_features): Creates a new DataFrame with the scaled data.
5. Detecting the Optimal Number of Clusters
inertia = [] for k in range(1, 10): kmeans = KMeans(n_clusters=k, random_state=42).fit(penguins_preprocessed) inertia.append(kmeans.inertia_) plt.plot(range(1, 10), inertia, marker='o') plt.xlabel('Number of clusters') plt.ylabel('Inertia') plt.title('Elbow Method') plt.show() n_clusters = 3
inertia = []: Initializes a list to store inertia values.for k in range(1, 10): Iterates over a range of cluster numbers (from 1 to 9).kmeans = KMeans(n_clusters=k, random_state=42).fit(penguins_preprocessed): Fits KMeans clustering withkclusters.inertia.append(kmeans.inertia_): Appends the inertia value (sum of squared distances to closest cluster center) for eachk.plt.plot(range(1, 10), inertia, marker='o'): Plots the inertia values to visualize the elbow.plt.xlabel('Number of clusters'): Labels the x-axis.plt.ylabel('Inertia'): Labels the y-axis.plt.title('Elbow Method'): Sets the title of the plot.plt.show(): Displays the plot.n_clusters = 3: Selects the number of clusters based on the elbow plot.
6. Running the K-means Clustering Algorithm
kmeans = KMeans(n_clusters=n_clusters, random_state=42).fit(penguins_preprocessed) penguins_df['label'] = kmeans.labels_
kmeans = KMeans(n_clusters=n_clusters, random_state=42).fit(penguins_preprocessed): Fits KMeans with the chosen number of clusters.penguins_df['label'] = kmeans.labels_: Adds the cluster labels to the original DataFrame.
7. Visualizing the Clusters
plt.scatter(penguins_df['culmen_length_mm'], penguins_df['flipper_length_mm'], c=penguins_df['label'], cmap='viridis') plt.xlabel('Culmen Length (mm)') plt.ylabel('Flipper Length (mm)') plt.title(f'K-means Clustering (K={n_clusters})') plt.colorbar(label='Cluster') plt.show()
plt.scatter(penguins_df['culmen_length_mm'], penguins_df['flipper_length_mm'], c=penguins_df['label'], cmap='viridis'): Creates a scatter plot ofculmen_length_mmvs.flipper_length_mm, colored by cluster labels.plt.xlabel('Culmen Length (mm)'): Labels the x-axis.plt.ylabel('Flipper Length (mm)'): Labels the y-axis.plt.title(f'K-means Clustering (K={n_clusters})'): Sets the title of the plot.plt.colorbar(label='Cluster'): Adds a color bar to indicate cluster labels.plt.show(): Displays the plot.
8. Creating the stat_penguins DataFrame
stat_penguins DataFramestat_penguins = penguins_df[numeric_features + ['label']].groupby('label').mean() print(stat_penguins)
penguins_df[numeric_features + ['label']]: Selects the numeric columns and cluster labels.groupby('label').mean(): Groups by the cluster labels and calculates the mean for each numeric feature within each cluster.print(stat_penguins): Displays the average values of the features for each cluster.