Clustering Antarctic Penguin Species
Arctic Penguin Exploration: Unraveling Clusters in the Icy Domain with K-means Clustering
source: @allison_horst https://github.com/allisonhorst/penguins
Project Description
Delve into the information about penguins by utilizing unsupervised learning techniques on a thoughtfully curated dataset. uncover concealed patterns, clusters, and relationships that exist within the dataset.
You have been asked to support a team of researchers who have been collecting data about penguins in Antartica! The data is available in csv-Format as penguins.csv
Origin of this data : Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network.
The dataset consists of 5 columns.
| Column | Description |
|---|---|
| culmen_length_mm | culmen length (mm) |
| culmen_depth_mm | culmen depth (mm) |
| flipper_length_mm | flipper length (mm) |
| body_mass_g | body mass (g) |
| sex | penguin sex |
Unfortunately, they have not been able to record the species of penguin, but they know that there are at least three species that are native to the region: Adelie, Chinstrap, and Gentoo. Your task is to apply your data science skills to help them identify groups in the dataset!
Utilize unsupervised learning skills to clusters in the penguins dataset!
- Import, investigate and pre-process the "
penguins.csv" dataset. - Perform a cluster analysis based on a reasonable number of clusters and collect the average values for the clusters. The output should be a DataFrame named
stat_penguinswith one row per cluster that shows the mean of the original variables (or columns in "penguins.csv") by cluster.stat_penguinsshould not include any non-numeric columns.
# Import Required Packages
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
# Loading and examining the dataset
penguins_df = pd.read_csv("penguins.csv")
penguins_df.head()display(penguins_df.info())
display(penguins_df.isna().sum())display(penguins_df['sex'].value_counts())1. Perform preprocessing steps on the dataset to create dummy variables
Create dummy variables for the available categorical feature in the dataset, then drop the original column.
# Perform preprocessing steps on the dataset to create dummy variables
# Using get dummies method on the categorical feature
# Convert categorical variables into dummy/indicator variables
pre_penguin = pd.get_dummies(penguins_df, drop_first=True)
pre_penguin.head()# Perform preprocessing steps on the dataset - standarizing/scaling
# Standardizing (also called scaling) a dataset before clustering can improve accuracy and is generally recommended
# Scaling variables (also called standardizing) is recommended before performing a clustering algorithm since this can increase the performance greatly (see https://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html)
scaler = StandardScaler()
X = scaler.fit_transform(pre_penguin)
penguins_preprocessed = pd.DataFrame(data=X,columns=pre_penguin.columns)
penguins_preprocessed.head(10)2. Detect the optimal number of clusters for k-means clustering
Perform Elbow analysis to determine the optimal number of clusters for this dataset.
ks = range(1,10)
inertias=[]
for k in ks:
# Create a KMeans instance with k clusters: model
# and Fit model to penguins_preprocessed
model = KMeans(n_clusters=k, random_state=42).fit(penguins_preprocessed)
# Append the inertia to the list of inertias
inertias.append(model.inertia_)
# Plot ks vs inertias
plt.plot(ks, inertias, '-o')
plt.xlabel('number of clusters, k')
plt.ylabel('inertia')
plt.title('Elbow Method')
plt.xticks(ks)
plt.show()3. Run the k-means clustering algorithm
Using the optimal number of clusters obtained from the previous step, run the k-means clustering algorithm once more on the preprocessed data.
n_clusters=4
# Run the k-means clustering algorithm
# with the optimal number of clusters
kmeans = KMeans(n_clusters=n_clusters, random_state=42).fit(penguins_preprocessed)
penguins_df['label'] = kmeans.labels_
penguins_df['label'].value_counts()