Introduction to tSNE
In this tutorial, you’ll learn about the recently discovered Dimensionality Reduction technique known as tDistributed Stochastic Neighbor Embedding (tSNE).
Dimensionality Reduction
If you have worked with a dataset before with a lot of features, you can fathom how difficult it is to understand or explore the relationships between the features. Not only it makes the EDA process difficult but also affects the machine learning model’s performance since the chances are that you might overfit your model or violate some of the assumptions of the algorithm, like the independence of features in linear regression. This is where dimensionality reduction comes in. In machine learning, dimensionality reduction is the process of reducing the number of random variables under consideration by obtaining a set of principal variables. By reducing the dimension of your feature space, you have fewer relationships between features to consider which can be explored and visualized easily and also you are less likely to overfit your model.
Dimensionality reduction can be achieved in the following ways:

Feature Elimination: You reduce the feature space by eliminating features. This has a disadvantage though, as you gain no information from those features that you have dropped.

Feature Selection: You apply some statistical tests in order to rank them according to their importance and then select a subset of features for your work. This again suffers from information loss and is less stable as different test gives different importance score to features. You can check more on this here.

Feature Extraction: You create new independent features, where each new independent feature is a combination of each of the old independent features. These techniques can further be divided into linear and nonlinear dimensionality reduction techniques.
Principal Component Analysis (PCA)
Principal Component Analysis or PCA is a linear feature extraction technique. It performs a linear mapping of the data to a lowerdimensional space in such a way that the variance of the data in the lowdimensional representation is maximized. It does so by calculating the eigenvectors from the covariance matrix. The eigenvectors that correspond to the largest eigenvalues (the principal components) are used to reconstruct a significant fraction of the variance of the original data.
In simpler terms, PCA combines your input features in a specific way that you can drop the least important feature while still retaining the most valuable parts of all of the features. As an added benefit, each of the new features or components created after PCA are all independent of one another.
tDistributed Stochastic Neighbor Embedding (tSNE)
tDistributed Stochastic Neighbor Embedding (tSNE) is a nonlinear technique for dimensionality reduction that is particularly well suited for the visualization of highdimensional datasets. It is extensively applied in image processing, NLP, genomic data and speech processing. To keep things simple, here’s a brief overview of working of tSNE:
 The algorithms starts by calculating the probability of similarity of points in highdimensional space and calculating the probability of similarity of points in the corresponding lowdimensional space. The similarity of points is calculated as the conditional probability that a point A would choose point B as its neighbor if neighbors were picked in proportion to their probability density under a Gaussian (normal distribution) centered at A.
 It then tries to minimize the difference between these conditional probabilities (or similarities) in higherdimensional and lowerdimensional space for a perfect representation of data points in lowerdimensional space.
 To measure the minimization of the sum of difference of conditional probability tSNE minimizes the sum of KullbackLeibler divergence of overall data points using a gradient descent method.
Note KullbackLeibler divergence or KL divergence is is a measure of how one probability distribution diverges from a second, expected probability distribution.
Those who are interested in knowing the detailed working of an algorithm can refer to this research paper.
In simpler terms, tDistributed stochastic neighbor embedding (tSNE) minimizes the divergence between two distributions: a distribution that measures pairwise similarities of the input objects and a distribution that measures pairwise similarities of the corresponding lowdimensional points in the embedding.
In this way, tSNE maps the multidimensional data to a lower dimensional space and attempts to find patterns in the data by identifying observed clusters based on similarity of data points with multiple features. However, after this process, the input features are no longer identifiable, and you cannot make any inference based only on the output of tSNE. Hence it is mainly a data exploration and visualization technique.
Using tSNE in Python
Now you will apply tSNE on an open source dataset and try to visualize the results. In addition, you will also visualize the output of PCA on the same dataset to compare it with that of tSNE.
The dataset you will be using is FashionMNIST dataset and can be found here (don't forget to check it out!). The FashionMNIST dataset is a 28x28 grayscale image of 70,000 fashion products from 10 categories, with 7,000 images per category. The training set has 60,000 images, and the test set has 10,000 images. FashionMNIST is a replacement for the original MNIST dataset for producing better results, the image dimensions, training and test splits are similar to the original MNIST dataset. Similar to MNIST the FashionMNIST also consists of 10 labels, but instead of handwritten digits, you have 10 different labels of fashion accessories like sandals, shirt, trousers, etc.
Each training and test example is assigned to one of the following labels:
 0 Tshirt/top
 1 Trouser
 2 Pullover
 3 Dress
 4 Coat
 5 Sandal
 6 Shirt
 7 Sneaker
 8 Bag
 9 Ankle boot
You can find mnist_reader.py
file which is present in utils
folder of the Github repository. This file has a function load_mnist()
which is specifically written to read .gz
files, or you can define the function as follows:
# Fashion MNIST reader
def load_mnist(path, kind='train'):
import os
import gzip
import numpy as np
"""Load MNIST data from `path`"""
labels_path = os.path.join(path,
'%slabelsidx1ubyte.gz'
% kind)
images_path = os.path.join(path,
'%simagesidx3ubyte.gz'
% kind)
with gzip.open(labels_path, 'rb') as lbpath:
labels = np.frombuffer(lbpath.read(), dtype=np.uint8,
offset=8)
with gzip.open(images_path, 'rb') as imgpath:
images = np.frombuffer(imgpath.read(), dtype=np.uint8,
offset=16).reshape(len(labels), 784)
return images, labels
You will apply the algorithms on the training set data. Use the load_mnist()
function you just defined and pass the first argument as the location of the data files and second argument kind
as 'train'
to read the training set data.
X_train, y_train = load_mnist('Downloads/datasets/mnist/fashion_mnist', kind='train')
You can check the dimensions of the data using shape
attribute. You will notice the training set has 60000 sample points and 784 features.
X_train.shape
(60000, 784)
The variable y_train
contains the labels of every sample marked as integers from 0 to 9.
y_train
array([9, 0, 0, ..., 3, 0, 5], dtype=uint8)
Next, you will import the necessary libraries you will be using throughout the tutorial. To maintain reproducibility, you will define a random state variable RS
and set it to 123.
import time
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patheffects as PathEffects
%matplotlib inline
import seaborn as sns
sns.set_style('darkgrid')
sns.set_palette('muted')
sns.set_context("notebook", font_scale=1.5,
rc={"lines.linewidth": 2.5})
RS = 123
In order to visualize the results of both the algorithms, you will create a fashion_scatter()
function which takes two arguments: 1. x
, which is a 2D numpy array containing the output of the algorithm and 2. colors
, which is 1D numpy array containing the labels of the dataset. The function will render a scatter plot with as many unique colors as the number of classes in the variable colors
.
# Utility function to visualize the outputs of PCA and tSNE
def fashion_scatter(x, colors):
# choose a color palette with seaborn.
num_classes = len(np.unique(colors))
palette = np.array(sns.color_palette("hls", num_classes))
# create a scatter plot.
f = plt.figure(figsize=(8, 8))
ax = plt.subplot(aspect='equal')
sc = ax.scatter(x[:,0], x[:,1], lw=0, s=40, c=palette[colors.astype(np.int)])
plt.xlim(25, 25)
plt.ylim(25, 25)
ax.axis('off')
ax.axis('tight')
# add the labels for each digit corresponding to the label
txts = []
for i in range(num_classes):
# Position of each label at median of data points.
xtext, ytext = np.median(x[colors == i, :], axis=0)
txt = ax.text(xtext, ytext, str(i), fontsize=24)
txt.set_path_effects([
PathEffects.Stroke(linewidth=5, foreground="w"),
PathEffects.Normal()])
txts.append(txt)
return f, ax, sc, txts
To make sure you don’t burden your machine concerning memory and time you will only use the first 20,000 samples to run the algorithms on. Also, don't forget to check whether the first 20000 samples cover samples from all the 10 classes.
# Subset first 20k data points to visualize
x_subset = X_train[0:20000]
y_subset = y_train[0:20000]
print np.unique(y_subset)
[0 1 2 3 4 5 6 7 8 9]
Now you will use PCA on the subset data and visualize its output. You will use sklearn's PCA with n_components
(Number of principal components to keep) equal to 4
. There are a variety of parameters available in sklearn that can be tweaked, but for now, you will use default values. If you wish to know more about these parameters check out sklearn's PCA documentation.
from sklearn.decomposition import PCA
time_start = time.time()
pca = PCA(n_components=4)
pca_result = pca.fit_transform(x_subset)
print 'PCA done! Time elapsed: {} seconds'.format(time.time()time_start)
PCA done! Time elapsed: 1.55137515068 seconds
Notice the time taken by PCA to run on x_subset
with 20000 samples. Quite fast isn't it?
Now you will store all the four principal components in a new DataFrame pca_df
and check the amount of variance of the data explained by these four components.
pca_df = pd.DataFrame(columns = ['pca1','pca2','pca3','pca4'])
pca_df['pca1'] = pca_result[:,0]
pca_df['pca2'] = pca_result[:,1]
pca_df['pca3'] = pca_result[:,2]
pca_df['pca4'] = pca_result[:,3]
print 'Variance explained per principal component: {}'.format(pca.explained_variance_ratio_)
Variance explained per principal component: [0.29021329 0.1778743 0.06015076 0.04975864]
Note that the first and second principal components explain almost 48% of the variance in the data x_subset
. You will use these two components for visualization by passing them in the fashion_scatter()
function.
top_two_comp = pca_df[['pca1','pca2']] # taking first and second principal component
fashion_scatter(top_two_comp.values,y_subset) # Visualizing the PCA output
(<matplotlib.figure.Figure at 0x7f380ee716d0>,
<matplotlib.axes._subplots.AxesSubplot at 0x7f380d56b710>,
<matplotlib.collections.PathCollection at 0x7f380eedec90>,
[Text(774.45,689.695,u'0'),
Text(42.8638,1429.36,u'1'),
Text(1098.28,376.975,u'2'),
Text(329.272,1111.49,u'3'),
Text(1266.97,295.466,u'4'),
Text(1589.56,61.8677,u'5'),
Text(713.473,65.687,u'6'),
Text(1521.41,335.666,u'7'),
Text(63.8177,1003.33,u'8'),
Text(645.691,1168.24,u'9')])
As you can notice, PCA has tried to separate the different points and form clustered groups of similar points. Also, the plot can be used for exploratory analysis. The principal components (pca1, pca2, etc.) can be used as features in classification or clustering algorithms.
Now you will do the same exercise using the tSNE algorithm. Scikitlearn has an implementation of tSNE available, and you can check its documentation here. It provides a wide variety of tuning parameters for tSNE, and the most notable ones are:
 n_components (default: 2): Dimension of the embedded space.
 perplexity (default: 30): The perplexity is related to the number of nearest neighbors that are used in other manifold learning algorithms. Consider selecting a value between 5 and 50.
 early_exaggeration (default: 12.0): Controls how tight natural clusters in the original space are in the embedded space and how much space will be between them.
 learning_rate (default: 200.0): The learning rate for tSNE is usually in the range (10.0, 1000.0).
 n_iter (default: 1000): Maximum number of iterations for the optimization. Should be at least 250.
 method (default: ‘barnes_hut’): BarnesHut approximation runs in O(NlogN) time. method=’exact’ will run on the slower, but exact, algorithm in O(N^2) time.
You will run tSNE on x_subset
with default parameters.
from sklearn.manifold import TSNE
import time
time_start = time.time()
fashion_tsne = TSNE(random_state=RS).fit_transform(x_subset)
print 'tSNE done! Time elapsed: {} seconds'.format(time.time()time_start)
tSNE done! Time elapsed: 1172.31058598 seconds
It can be seen that tSNE takes considerably longer time to execute on the same sample size of data than PCA.
Visualizing the output of tSNE using fashion_scatter()
function:
fashion_scatter(fashion_tsne, y_subset)
(<matplotlib.figure.Figure at 0x7f380ee71410>,
<matplotlib.axes._subplots.AxesSubplot at 0x7f380ee92190>,
<matplotlib.collections.PathCollection at 0x7f384c0530d0>,
[Text(41.5947,6.30529,u'0'),
Text(16.6429,61.3673,u'1'),
Text(13.9577,39.045,u'2'),
Text(23.7094,24.126,u'3'),
Text(31.9231,32.8377,u'4'),
Text(36.6758,5.25523,u'5'),
Text(16.933,17.417,u'6'),
Text(59.4418,11.159,u'7'),
Text(22.5197,37.1971,u'8'),
Text(34.8161,39.4658,u'9')])
As you can notice, this is already a significant improvement over the PCA visualization you created earlier. You can see that the digits are very clearly clustered in their own little group. If you would now use a clustering algorithm to pick out the separate clusters, you could probably quite accurately assign new points to a label.
Scikitlearn's documentation of tSNE explicitly states that:
It is highly recommended to use another dimensionality reduction method (e.g., PCA for dense data or TruncatedSVD for sparse data) to reduce the number of dimensions to a reasonable amount (e.g., 50) if the number of features is very high. This will suppress some noise and speed up the computation of pairwise distances between samples.
You’ll now take this recommendation to heart and actually, reduce the number of dimensions before feeding the data into the tSNE algorithm. For this, you’ll use PCA again. You will first create a new dataset containing the fifty dimensions generated by the PCA reduction algorithm, then use this dataset to perform the tSNE.
time_start = time.time()
pca_50 = PCA(n_components=50)
pca_result_50 = pca_50.fit_transform(x_subset)
print 'PCA with 50 components done! Time elapsed: {} seconds'.format(time.time()time_start)
print 'Cumulative variance explained by 50 principal components: {}'.format(np.sum(pca_50.explained_variance_ratio_))
PCA with 50 components done! Time elapsed: 2.34411096573 seconds
Cumulative variance explained by 50 principal components: 0.862512386067
Now you will apply tSNE on the PCA's output pca_result_50
.
import time
time_start = time.time()
fashion_pca_tsne = TSNE(random_state=RS).fit_transform(pca_result_50)
print 'tSNE done! Time elapsed: {} seconds'.format(time.time()time_start)
tSNE done! Time elapsed: 491.321714163 seconds
Notice how the time taken to run tSNE is reduced drastically.
fashion_scatter(fashion_pca_tsne, y_subset)
(<matplotlib.figure.Figure at 0x7f380d56be90>,
<matplotlib.axes._subplots.AxesSubplot at 0x7f380d580e10>,
<matplotlib.collections.PathCollection at 0x7f380ee207d0>,
[Text(6.34246,19.2471,u'0'),
Text(26.9357,58.5841,u'1'),
Text(18.1853,41.8484,u'2'),
Text(50.7503,17.6256,u'3'),
Text(33.4893,35.3055,u'4'),
Text(36.2952,5.40452,u'5'),
Text(13.7695,19.4419,u'6'),
Text(61.3563,12.7764,u'7'),
Text(20.3385,34.2562,u'8'),
Text(38.9556,44.1898,u'9')])
You can notice the plot is more or less similar to the previous one but with one difference. Notice the data points corresponding to label 0
(or images corresponding to Tshirt/top) are clustered more closely now as compared to the earlier plot. Also, the time taken to run tSNE has decreased.
PCA vs. tSNE
Although both PCA and tSNE have their own advantages and disadvantages, some key differences between PCA and tSNE can be noted as follows:
 tSNE is computationally expensive and can take several hours on millionsample datasets where PCA will finish in seconds or minutes.
 PCA it is a mathematical technique, but tSNE is a probabilistic one.
 Linear dimensionality reduction algorithms, like PCA, concentrate on placing dissimilar data points far apart in a lower dimension representation. But in order to represent high dimension data on low dimension, nonlinear manifold, it is essential that similar data points must be represented close together, which is something tSNE does not PCA.
 Sometimes in tSNE different runs with the same hyperparameters may produce different results hence multiple plots must be observed before making any assessment with tSNE, while this is not the case with PCA.
 Since PCA is a linear algorithm, it will not be able to interpret the complex polynomial relationship between features while tSNE is made to capture exactly that.
Conclusion
Congrats!! You have made it to the end of this tutorial. You started with learning about Dimensionality Reduction and explored two of its most popular techniques, Principal Component Analysis and tDistributed Stochastic Neighbor Embedding. You also applied these two algorithms on the same dataset and compared their results by creating some beautiful plots. However, you have merely scratched the surface as there is a lot that can be explored in tSNE. Hope this tutorial gives a head start for your next journey with tSNE.
If you would like to learn more about Python, take DataCamp's Introduction to Data Visualization with Matplotlib course.
Learn more about Python
Intermediate Python
Introduction to Data Science in Python
PrecisionRecall Curve in Python Tutorial
Vidhi Chugh
14 min