Principal Component Analysis (PCA) in Python
Learn about PCA and how it can be leveraged to extract information from the data without any supervision using two popular datasets: Breast Cancer and CIFAR-10.
Introduction
Principal Component Analysis (PCA) is a linear dimensionality reduction technique that can be utilized for extracting information from a high-dimensional space by projecting it into a lower-dimensional sub-space. It tries to preserve the essential parts that have more variation of the data and remove the non-essential parts with fewer variation.
Dimensions are nothing but features that represent the data. For example, A 28 X 28 image has 784 picture elements (pixels) that are the dimensions or features which together represent that image.
One important thing to note about PCA is that it is an Unsupervised dimensionality reduction technique, you can cluster the similar data points based on the feature correlation between them without any supervision (or labels), and you will learn how to achieve this practically using Python in later sections of this tutorial!
According to Wikipedia, PCA is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables (entities each of which takes on various numerical values) into a set of values of linearly uncorrelated variables called principal components.
Note: Features, Dimensions, and Variables are all referring to the same thing. You will find them being used interchangeably.
But where can you apply PCA?
-
Data Visualization: When working on any data related problem, the challenge in today's world is the sheer volume of data, and the variables/features that define that data. To solve a problem where data is the key, you need extensive data exploration like finding out how the variables are correlated or understanding the distribution of a few variables. Considering that there are a large number of variables or dimensions along which the data is distributed, visualization can be a challenge and almost impossible.
Hence, PCA can do that for you since it projects the data into a lower dimension, thereby allowing you to visualize the data in a 2D or 3D space with a naked eye.
-
Speeding Machine Learning (ML) Algorithm: Since PCA's main idea is dimensionality reduction, you can leverage that to speed up your machine learning algorithm's training and testing time considering your data has a lot of features, and the ML algorithm's learning is too slow.
At an abstract level, you take a dataset having many features, and you simplify that dataset by selecting a few Principal Components
from original features.
What is a Principal Component?
Principal components are the key to PCA; they represent what's underneath the hood of your data. In a layman term, when the data is projected into a lower dimension (assume three dimensions) from a higher space, the three dimensions are nothing but the three Principal Components that captures (or holds) most of the variance (information) of your data.
Principal components have both direction and magnitude. The direction represents across which principal axes the data is mostly spread out or has most variance and the magnitude signifies the amount of variance that Principal Component captures of the data when projected onto that axis. The principal components are a straight line, and the first principal component holds the most variance in the data. Each subsequent principal component is orthogonal to the last and has a lesser variance. In this way, given a set of x correlated variables over y samples you achieve a set of u uncorrelated principal components over the same y samples.
The reason you achieve uncorrelated principal components from the original features is that the correlated features contribute to the same principal component, thereby reducing the original data features into uncorrelated principal components; each representing a different set of correlated features with different amounts of variation.
Each principal component represents a percentage of total variation captured from the data.
In today's tutorial, you will mainly apply PCA on the two use-cases:
Data Visualization
Speeding ML algorithm
To accomplish the above two tasks, you will use two famous Breast Cancer (numerical) and CIFAR - 10 (image) dataset.
Understanding the Data
Before you go ahead and load the data, it's good to understand and look at the data that you will be working with!
Breast Cancer
The Breast Cancer data set is a real-valued multivariate data that consists of two classes, where each class signifies whether a patient has breast cancer or not. The two categories are: malignant and benign.
The malignant class has 212 samples, whereas the benign class has 357 samples.
It has 30 features shared across all classes: radius, texture, perimeter, area, smoothness, fractal dimension, etc.
You can download the breast cancer dataset from here, or rather an easy way is by loading it with the help of the sklearn
library.
CIFAR - 10
The CIFAR-10 (Canadian Institute For Advanced Research) dataset consists of 60000 images each of 32x32x3 color images having ten classes, with 6000 images per category.
The dataset consists of 50000 training images and 10000 test images.
The classes in the dataset are airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck.
You can download the CIFAR dataset from here, or you can also load it on the fly with the help of a deep learning library like Keras
.
Data Exploration
Now you will be loading and analyzing the Breast Cancer
and CIFAR-10
datasets. By now you have an idea regarding the dimensionality of both datasets.
So, let's quickly explore both datasets.
Breast Cancer Data Exploration
Let's first explore the Breast Cancer
dataset.
You will use sklearn's
module datasets
and import the Breast Cancer
dataset from it.
%%capture
!pip install -r requirements.txt
from sklearn.datasets import load_breast_cancer
load_breast_cancer
will give you both labels and the data. To fetch the data, you will call .data
and for fetching the labels .target
.
The data has 569 samples with thirty features, and each sample has a label associated with it. There are two labels in this dataset.
breast = load_breast_cancer()
breast_data = breast.data
Let's check the shape of the data.
breast_data.shape
Even though for this tutorial, you do not need the labels but still for better understanding, let's load the labels and check the shape.
breast_labels = breast.target
breast_labels.shape
Now you will import numpy
since you will be reshaping the breast_labels
to concatenate it with the breast_data
so that you can finally create a DataFrame
which will have both the data and labels.
import numpy as np