Skip to content

Arctic Penguin Exploration: Unraveling Clusters in the Icy Domain with K-means clustering

Alt text source: @allison_horst https://github.com/allisonhorst/penguins

You have been asked to support a team of researchers who have been collecting data about penguins in Antartica!

Origin of this data : Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network.

The dataset consists of 5 columns.

  • culmen_length_mm: culmen length (mm)
  • culmen_depth_mm: culmen depth (mm)
  • flipper_length_mm: flipper length (mm)
  • body_mass_g: body mass (g)
  • sex: penguin sex

Unfortunately, they have not been able to record the species of penguin, but they know that there are three species that are native to the region: Adelie, Chinstrap, and Gentoo, so your task is to apply your data science skills to help them identify groups in the dataset!

# Import Required Packages
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import seaborn as sns
import numpy as np

# Loading and examining the dataset
penguins_df = pd.read_csv("data/penguins.csv")
display(penguins_df)
# Check data summarization
penguins_df.info()

Cleaning Data

Checking and removing missing values

# Check size of data
print('Size data : {}'.format(len(penguins_df)))
print('Size of 5% data : {}'.format(int(0.05*len(penguins_df))))

# Check missing values
penguins_df.isna().sum()

From data above, amount missing values entire column < 5% size data, therefore we decide to remove it.

# Removing missing values
penguins_df = penguins_df.dropna()

# Checking missing values
penguins_df.isna().sum()
# Size data after removing missing values
print('Size data after removing values : {}'.format(len(penguins_df)))

Now data was clean from missing values and we can continue to remove outliers.

Checking and removing outliers

# Checking statistic distribution
penguins_df.describe()

From statistic summarization, there are column that is flipper_length_mm contains outliers, therefore we must remove it to fix the data distribution. For validate that hypothesys let we visualize data distribution using boxplot entire columns.