Skip to content
Project: Clustering Antarctic Penguin Species
source: @allison_horst https://github.com/allisonhorst/penguins
You have been asked to support a team of researchers who have been collecting data about penguins in Antartica! The data is available in csv-Format as penguins.csv
Origin of this data : Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network.
The dataset consists of 5 columns.
| Column | Description |
|---|---|
| culmen_length_mm | culmen length (mm) |
| culmen_depth_mm | culmen depth (mm) |
| flipper_length_mm | flipper length (mm) |
| body_mass_g | body mass (g) |
| sex | penguin sex |
Unfortunately, they have not been able to record the species of penguin, but they know that there are at least three species that are native to the region: Adelie, Chinstrap, and Gentoo. Your task is to apply your data science skills to help them identify groups in the dataset!
# Import Required Packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
# Loading and examining the dataset
penguins_df = pd.read_csv("penguins.csv")
penguins_df.head()#transform data into numpy array
penguins_array = penguins_df.drop("sex", axis=1).values
print(penguins_array)#build pipeline and compute inertia
inertia = []
num_clusters = range(1,21)
for i in num_clusters:
scaler = StandardScaler()
kmeans = KMeans(n_clusters=i)
pipeline = make_pipeline(scaler, kmeans)
pipeline.fit(penguins_array)
inertia.append(kmeans.inertia_)
plt.plot(num_clusters, inertia, marker="o")
plt.show()#from the inertia plot it seems reasonable to work with 5 clusters
scaler = StandardScaler()
kmeans = KMeans(n_clusters=5, random_state=42)
pipeline = make_pipeline(scaler, kmeans)
pipeline.fit(penguins_array)
labels = pipeline.predict(penguins_array)
centers_transformed = kmeans.cluster_centers_
print(centers_transformed)
data_mean = scaler.mean_
data_std = np.sqrt(scaler.var_)
print(data_mean)
print(data_std)
#centers = np.array([centers_transformed[:,i] * data_std[i] + data_mean[i] for i in range(len(data_mean))])
centers = np.einsum("ij,j->ij", centers_transformed, data_std) + data_mean
stat_penguins = pd.DataFrame(centers, columns=penguins_df.drop("sex", axis=1).columns)
print(stat_penguins)
#print(centers)
#for i in range(4):
# for j in range(i+1,4):
# plt.scatter(penguins_array[:,i], penguins_array[:,j], c=labels)
# plt.show()#different way of computing the final data base
penguins_df_modified = penguins_df.drop("sex", axis=1)
penguins_df_modified["cluster_number"] = labels
stat_penguins1 = penguins_df_modified.groupby("cluster_number").mean()
stat_penguins1.index.name = None
print(stat_penguins1-stat_penguins)#different way
#scaler1 = StandardScaler()
#scaler1.fit(penguins_array)
#transformed_penguins_array = scaler1.transform(penguins_array)
#kmeans1 = KMeans(n_clusters=5)
#kmeans1.fit(transformed_penguins_array)
#labels1 = kmeans1.predict(transformed_penguins_array)
#for i in range(4):
# for j in range(i+1,4):
# print(i, j)
# plt.scatter(transformed_penguins_array[:,i], transformed_penguins_array[:,j], c=labels1)
# plt.show()