Arctic Penguin Exploration: Unraveling Clusters in the Icy Domain with K-means clustering
source: @allison_horst https://github.com/allisonhorst/penguins
You have been asked to support a team of researchers who have been collecting data about penguins in Antartica!
Origin of this data : Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network.
The dataset consists of 5 columns.
- culmen_length_mm: culmen length (mm)
- culmen_depth_mm: culmen depth (mm)
- flipper_length_mm: flipper length (mm)
- body_mass_g: body mass (g)
- sex: penguin sex
Unfortunately, they have not been able to record the species of penguin, but they know that there are three species that are native to the region: Adelie, Chinstrap, and Gentoo, so your task is to apply your data science skills to help them identify groups in the dataset!
1 hidden cell
How to approach the project
-
Loading and examining the dataset
-
Dealing with null values and outliers
-
Perform preprocessing steps on the dataset to create dummy variables
-
Perform preprocessing steps on the dataset - scaling
-
Perform PCA
-
Detect the optimal number of clusters for k-means clustering
-
Run the k-means clustering algorithm
-
Create a final statistical DataFrame for each cluster.
# Import Required Packages
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
# Loading and examining the dataset
penguins_df = pd.read_csv("data/penguins.csv")# Investigate the dataset
display(penguins_df.head())
display(penguins_df.describe())
display(penguins_df.info())
display(penguins_df.isna().sum())# Visualising the dataset
sns.boxplot(penguins_df.iloc[:, :4])
print("There are outliers in column 'flipper_length mm'")# Dealing with na & outliers
# Remove na values
penguins_clean = penguins_df.dropna()
print(penguins_clean.isna().sum())
# Remove outliers
# Find the max & min values in the 'flipper_length_mm' column
max_flipper_length = penguins_clean['flipper_length_mm'].max()
min_flipper_length = penguins_clean['flipper_length_mm'].min()
# Find the index label(s) corresponding to the max & min value
max_index = penguins_clean[penguins_clean['flipper_length_mm'] == max_flipper_length].index
min_index = penguins_clean[penguins_clean['flipper_length_mm'] == min_flipper_length].index
# Concatenate the index arrays into a single array
indexes_to_drop = max_index.append(min_index)
# Drop the row(s) with the maximum flipper length
penguins_clean.drop(indexes_to_drop, inplace=True)
# Re-visualise the clean dataset
sns.boxplot(penguins_clean.iloc[:, :4])# Perform preprocessing steps on the dataset to create dummy variables
# Create the dummy variables and remove the original categorical feature from the dataset.
encouded = pd.get_dummies(penguins_clean, columns= ['sex']).drop(columns=['sex_.'])
# # Scale the data using the standard scaling method.
scaler = StandardScaler()
X = scaler.fit_transform(encouded)
StandardScaler(copy=True, with_mean=True, with_std=True)
# #Save the updated data as a new DataFrame called _penguins_preprocessed_.
penguins_preprocessed = pd.DataFrame(X, columns=encouded.columns)# Perform Principle Components Analysis ("PCA")
model = PCA()
model.fit(penguins_preprocessed)
var_ratio = model.explained_variance_ratio_
n_components = sum(var_ratio > 0.1)
PCA(n_components= n_components)
penguins_PCA = model.fit_transform(penguins_preprocessed)# Detect the optimal number of clusters for k-means clustering
inertia = []
# Elbow analysis
for k in range(1, 10):
kmeans = KMeans(n_clusters=k, random_state=42,).fit(penguins_PCA)
inertia.append(kmeans.inertia_)
# Visualise the inertia
plt.plot(range(1, 10), inertia, marker='o')
plt.xlabel('n_clusters')
plt.ylabel('Intertia')
plt.title('Elbow method')
plt.show()
# Decide the numer of clusters
n_clusters = 4# Run the k-means clustering algorithm
kmeans = KMeans(n_clusters=4, random_state=42,).fit(penguins_PCA)
# Visualise the clusters
plt.scatter(penguins_PCA[:, 0], penguins_PCA[:, 1], c=kmeans.labels_)
# Create a final statistical DataFrame for each cluster.
penguins_clean['label'] = kmeans.labels_
numeric_columns = ['culmen_length_mm', 'culmen_depth_mm', 'flipper_length_mm','label']
stat_penguins = penguins_clean.groupby('label')[numeric_columns].mean()
print(stat_penguins)# Print
print(penguins_PCA.shape)
display(penguins_clean)