Skip to content

Principal Component Analysis

When you have a very wide dataset, you can reduce its dimensionality using principal component analysis (PCA). PCA relies on the concepts of linear transformations and singular value decomposition (SVD) in linear algebra.
Luckily, scikit-learn provides a package that we can use to perform PCA instead of worrying about the linear algebra. You can find the documentation here.

# Load packages
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
# Load data from the csv file
df = pd.read_csv("iris.csv", index_col=0)
df.head()
# Define the features for which you want the principal components
features = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
X = df.loc[:, features].values

# Get the output
y = df.loc[:,['target']].values
# The number of principal components we want
# This will become the dimension of our (input) dataset
N_COMPONENTS=2

# Import the PCA package from scikit-learn
from sklearn.decomposition import PCA

# Set up and execute PCA
pca = PCA(n_components=N_COMPONENTS)
principal_comps = pca.fit_transform(X)

# Convert into a new dataframe
principal_df = pd.DataFrame(data = principal_comps, columns = ['pc1', 'pc2'])
principal_df['target'] = y
principal_df.head()
# Change plot style
plt.style.use('seaborn-darkgrid')

# Plot the dataset with the principal components
fig=plt.scatter(
    x=principal_df['pc1'], 
    y=principal_df['pc2'], 
    c=principal_df['target']
)

As you can see, the dimension of the dataset is reduced while the different classes are still separable.

Acknowledgement

The dataset used in this template can be found here.