Skip to content

Alt text source: @allison_horst https://github.com/allisonhorst/penguins

You have been asked to support a team of researchers who have been collecting data about penguins in Antartica! The data is available in csv-Format as penguins.csv

Origin of this data : Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network.

The dataset consists of 5 columns.

ColumnDescription
culmen_length_mmculmen length (mm)
culmen_depth_mmculmen depth (mm)
flipper_length_mmflipper length (mm)
body_mass_gbody mass (g)
sexpenguin sex

Unfortunately, they have not been able to record the species of penguin, but they know that there are at least three species that are native to the region: Adelie, Chinstrap, and Gentoo. Your task is to apply your data science skills to help them identify groups in the dataset!

# Import Required Packages
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline


# Loading and examining the dataset
penguins_df = pd.read_csv("penguins.csv")
penguins_df.head()
print(penguins_df.isna().sum())
print(penguins_df.shape)

import matplotlib.pyplot as plt
import pandas as pd
x = penguins_df.iloc[:, 0]
y = penguins_df.iloc[:, 1]
plt.scatter(x, y)
plt.show()

n = [2, 3, 4, 5,6]
X_train = penguins_df.drop('sex', axis=1).values
inrt = []
for i in n:
    kmeans = KMeans(n_clusters = i)
    kmeans.fit(X_train)
    y_pred = kmeans.predict(X_train)
    inertia = kmeans.inertia_
    inrt.append(inertia)
'''
Inertia is the distance of the points of the cluster to its centroid and base here in the graph the inertia started to drop slowly at 3 clusters which was the elbow, base from the graph the best number of clusters is 3
'''
plt.plot(n,inrt,'o-')
plt.title('Inertia vs No. of Clusters')
plt.xlabel('No. of Clusters')
plt.ylabel('No. of Inertia')
plt.show()
#initializing our pipeline
pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('kmeans', KMeans(n_clusters=3))
    ])

pipeline.fit(X_train)
y_pred = pipeline.predict(X_train)

penguins_df['cluster'] = y_pred

species = ['Adelie', 'Chinstrap',  'Gentoo']
stat_penguins = penguins_df.groupby('cluster').mean().reset_index()

print(stat_penguins)

for index, row in penguins_df.iterrows():
    if row['cluster'] == 0:
        penguins_df.at[index, 'species'] = 'Adelie'
    elif row['cluster'] == 1:
        penguins_df.at[index, 'species'] = 'Chinstrap'
    elif row['cluster'] == 2:
        penguins_df.at[index, 'species'] = 'Gentoo'
#Kmeans really fitted the data per cluster base from the cross tab below, it fits perfectly well.        
penguins_cross = pd.crosstab(penguins_df['cluster'], penguins_df['species'])
print(penguins_cross)
plt.scatter(penguins_df.iloc[:,0] ,penguins_df.iloc[:,1], c = penguins_df['cluster'])
plt.legend()
plt.show()