Skip to main content
HomeTutorialsPython

Preprocessing in Data Science (Part 3): Scaling Synthesized Data

You can preprocess the heck out of your data but the proof is in the pudding: how well does your model then perform?
Updated May 2016  · 10 min read

In two previous posts, I explored the role of preprocessing data in the machine learning pipeline. In particular, I checked out the k-Nearest Neighbors (k-NN) and logistic regression algorithms and saw how scaling numerical data strongly influenced the performance of the former but not that of the latter, as measured, for example, by accuracy (see Glossary below or previous articles for definitions of scaling, k-NN and other relevant terms). The real take home message here was that preprocessing doesn’t occur in a vacuum, that is, you can prepocess the heck out of your data but the proof is in the pudding: how well does your model then perform?

Scaling numerical data (that is, multiplying all instances of a variable by a constant in order to change that variable’s range) has two related purposes: i) if your measurements are in meters and mine are in miles, then, if we both scale our data, they end up being the same & ii) if two variables have vastly different ranges, the one with the larger range may dominate your predictive model, even though it may be less important to your target variable than the variable with the smaller range. What we saw is that this problem identified in ii) occurs with k-NN, which explicitly looks at how close data are to one another but not in logistic regression which, when being trained, will shrink the relevant coefficient to account for the lack of scaling.

As the data we used in the previous articles was real-world data, all we could see was how the models performed before and after scaling. Here, in order to see how noise in the form of nuisance variables (those which do not effect the target variable but may effect your model) changes model performance both pre- and post-scaling, I’ll synthesize a dataset in which I can control the precise nature of the nuisance variable. We’ll see that the noisier the sythesized data, the more important scaling will be for k-NN. All examples herein will be in Python. If you’re not familiar with Python, you can check out our DataCamp courses here. I will make use of the libraries pandas for our DataFrame needs and scikit-learn for our machine learning needs.

In the code chunk below, we use scikit-learn’s make_blobs function to generate 2000 data points that are in 4 clusters (each data point has 2 predictor variables and 1 target variable).

# Generate some clustered data (blobs!)
import numpy as np
from sklearn.datasets.samples_generator import make_blobs
n_samples=2000
X, y = make_blobs(n_samples, centers=4, n_features=2,
                  random_state=0)

Plotting the synthesized data

We’ll now plot in the plane the data that we’ve synthesized. Each axis is a predictor variable and the colour is a key to the target variable:

%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')
plt.figure(figsize=(20,5));
plt.subplot(1, 2, 1 );
plt.scatter(X[:,0] , X[:,1],  c = y, alpha = 0.7);
plt.subplot(1, 2, 2);
plt.hist(y)
plt.show()

Scaling Synthesized Data

Note: we can see in the 2nd plot that all possible target variables are equally represented. In this case (or even if they are approximately equally represented), we say that the class y is balanced.

I now want to plot histograms of the features (predictor variables):

import pandas as pd
df = pd.DataFrame(X)
pd.DataFrame.hist(df, figsize=(20,5));

Scaling Synthesized Data

Let’s now split into testing & training sets & plot both sets:

from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
plt.figure(figsize=(20,5));
plt.subplot(1, 2, 1 );
plt.title('training set')
plt.scatter(X_train[:,0] , X_train[:,1],  c = y_train, alpha = 0.7);
plt.subplot(1, 2, 2);
plt.scatter(X_test[:,0] , X_test[:,1],  c = y_test, alpha = 0.7);
plt.title('test set')
plt.show()

Scaling Synthesized Data

Looking good! Now let’s instantiate a k-Nearest Neighbors voting classifier and train it on our training set:

from sklearn import neighbors, linear_model
knn = neighbors.KNeighborsClassifier()
knn_model = knn.fit(X_train, y_train)

Now that we have trained our model we can fit it to our test set and compute the accuracy:

print('k-NN score for test set: %f' % knn_model.score(X_test, y_test))
`k-NN score for test set: 0.935000`

We can also re-fit it to our training set and compute the accuracy. We would expect it to perform better on the training set than the test set:

print('k-NN score for training set: %f' % knn_model.score(X_train, y_train))
`k-NN score for training set: 0.941875`

It is worth reiterating that the default scoring method for k-NN in scikit-learn is accuracy. To check out a variety of other metrics, we can use scikit-learn’s classification report also:

from sklearn.metrics import classification_report
y_true, y_pred = y_test, knn_model.predict(X_test)
print(classification_report(y_true, y_pred))
                 precision    recall  f1-score   support
    
              0       0.87      0.90      0.88       106
              1       0.98      0.93      0.95       102
              2       0.90      0.92      0.91       100
              3       1.00      1.00      1.00        92
    
    avg / total       0.94      0.94      0.94       400

Now with scaling

I’ll now scale the predictor variables and then use k-NN again:

from sklearn.preprocessing import scale
Xs = scale(X)
Xs_train, Xs_test, y_train, y_test = train_test_split(Xs, y, test_size=0.2, random_state=42)
plt.figure(figsize=(20,5));
plt.subplot(1, 2, 1 );
plt.scatter(Xs_train[:,0] , Xs_train[:,1],  c = y_train, alpha = 0.7);
plt.title('scaled training set')
plt.subplot(1, 2, 2);
plt.scatter(Xs_test[:,0] , Xs_test[:,1],  c = y_test, alpha = 0.7);
plt.title('scaled test set')
plt.show()

Scaling Synthesized Data

knn_model_s = knn.fit(Xs_train, y_train)
print('k-NN score for test set: %f' % knn_model_s.score(Xs_test, y_test))
`k-NN score for test set: 0.935000`

It doesn’t perform any better with scaling! This is most likely because both features were already around the same range. It really makes sense to scale when variables have widely varying ranges. To see this in action, we’re going to add another feature. Moreover, this feature will bear no relevance to the target variable: it will be mere noise.

Adding noise to the signal:

We add a third variable of Gaussian noise with mean 0 and variable standard deviation \(\sigma\). We’ll call \(\sigma\) the strength of the noise and we’ll see that the stronger the noise, the worse the performance of k-Nearest Neighbours.

# Add noise column to predictor variables
ns = 10**(3) # Strength of noise term
newcol = np.transpose([ns*np.random.randn(n_samples)])
Xn = np.concatenate((X, newcol), axis = 1)

We’ll now use the mplot3d package to plot the 3D data:

from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure(figsize=(15,10));
ax = fig.add_subplot(111, projection='3d' , alpha = 0.5);
ax.scatter(Xn[:,0], Xn[:,1], Xn[:,2], c = y);

Scaling Synthesized Data

Now let’s see how our model performs on the new data:

Xn_train, Xn_test, y_train, y_test = train_test_split(Xn, y, test_size=0.2, random_state=42)
knn = neighbors.KNeighborsClassifier()
knn_model = knn.fit(Xn_train, y_train)
print('k-NN score for test set: %f' % knn_model.score(Xn_test, y_test))

k-NN score for test set: 0.400000

This is a horrible model! How about we scale and check out performance?

Xns = scale(Xn)
s = int(.2*n_samples)
Xns_train = Xns[s:]
y_train = y[s:]
Xns_test = Xns[:s]
y_test = y[:s]
knn = neighbors.KNeighborsClassifier()
knn_models = knn.fit(Xns_train, y_train)
print('k-NN score for test set: %f' % knn_models.score(Xns_test, y_test))
`k-NN score for test set: 0.907500`

Great, so after scaling the data, the model performs nearly as well as were there no noise introduced. Let’s now check out the model performance as a function of noise strength.

The stronger the noise, the bigger the problem:

We’re now going to see how the noise strength effects model accuracy. As we’ll need to use the same code a number of times, let’s actually wrap up the main parts in a small function:

def accu( X, y):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    knn = neighbors.KNeighborsClassifier()
    knn_model = knn.fit(X_train, y_train)
    return(knn_model.score(X_test, y_test))
noise = [10**i for i in np.arange(-1,6)]
A1 = np.zeros(len(noise))
A2 = np.zeros(len(noise))
count = 0
for ns in noise:
    newcol = np.transpose([ns*np.random.randn(n_samples)])
    Xn = np.concatenate((X, newcol), axis = 1)
    Xns = scale(Xn)
    A1[count] = accu( Xn, y)
    A2[count] = accu( Xns, y)
    count += 1

We now plot accuracy as a function of noise strength (note log x axis):

plt.scatter( noise, A1 )
plt.plot( noise, A1, label = 'unscaled', linewidth = 2)
plt.scatter( noise, A2 , c = 'r')
plt.plot( noise, A2 , label = 'scaled', linewidth = 2)
plt.xscale('log')
plt.xlabel('Noise strength')
plt.ylabel('Accuracy')
plt.legend(loc=3);

Scaling Synthesized Data

See in the above figure that the more noise there is in the nuisance variable, the more important it is to scale your data for the k-NN model! Below, you’ll have the opportunity to do the same for logistic regression. To conclude, we have seen the essential place occupied in the data scientific pipeline by preprocessing, in its scaling and centering incarnation, and we have done so to promote a holistic approach to the challenges of machine learning. In future articles, I hope to extend this discussion to other types of preprocessing, such as transformations of numerical data and preprocessing of categorical data, both essential aspects of any data scientists’s toolkit.

Exercise for the avid reader: try out fitting a logistic regression model to the above synthesized datasets and check out the model performance. How is accuracy a function of noise strength for scaled and unscaled data, respectively? You can do so in the DataCamp Light widget below! Change the exponent of 10 to alter the amount of noise (first try the range that I tried above for k-NN) and set sc = True if you want to scale your features. You can also check out DataCamp Light on Github!

eyJsYW5ndWFnZSI6InB5dGhvbiIsInNhbXBsZSI6Ilx0IyBCZWxvdywgY2hhbmdlIHRoZSBleHBvbmVudCBvZiAxMCB0byBhbHRlciB0aGUgYW1vdW50IG9mIG5vaXNlXG5cdG5zID0gMTAqKigzKSAjIFN0cmVuZ3RoIG9mIG5vaXNlIHRlcm1cblx0IyBTZXQgc2MgPSBUcnVlIGlmIHlvdSB3YW50IHRvIHNjYWxlIHlvdXIgZmVhdHVyZXNcblx0c2MgPSBGYWxzZVxuICAgIFxuXHQjSW1wb3J0IHBhY2thZ2VzXG5cdGltcG9ydCBudW1weSBhcyBucFxuXHRmcm9tIHNrbGVhcm4uY3Jvc3NfdmFsaWRhdGlvbiBpbXBvcnQgdHJhaW5fdGVzdF9zcGxpdFxuXHRmcm9tIHNrbGVhcm4gaW1wb3J0IG5laWdoYm9ycywgbGluZWFyX21vZGVsXG5cdGZyb20gc2tsZWFybi5wcmVwcm9jZXNzaW5nIGltcG9ydCBzY2FsZVxuXHRmcm9tIHNrbGVhcm4uZGF0YXNldHMuc2FtcGxlc19nZW5lcmF0b3IgaW1wb3J0IG1ha2VfYmxvYnNcbiAgICBcblx0I0dlbmVyYXRlIHNvbWUgZGF0YVxuXHRuX3NhbXBsZXM9MjAwMFxuXHRYLCB5ID0gbWFrZV9ibG9icyhuX3NhbXBsZXMsIGNlbnRlcnM9NCwgbl9mZWF0dXJlcz0yLFxuICAgICAgICAgICAgICAgICAgcmFuZG9tX3N0YXRlPTApXG5cblx0IyBBZGQgbm9pc2UgY29sdW1uIHRvIHByZWRpY3RvciB2YXJpYWJsZXNcblx0bmV3Y29sID0gbnAudHJhbnNwb3NlKFtucypucC5yYW5kb20ucmFuZG4obl9zYW1wbGVzKV0pXG5cdFhuID0gbnAuY29uY2F0ZW5hdGUoKFgsIG5ld2NvbCksIGF4aXMgPSAxKVxuXG5cdCNTY2FsZSBpZiBkZXNpcmVkXG5cdGlmIHNjID09IFRydWU6XG5cdFx0WG4gPSBzY2FsZShYbilcbiAgICBcblx0I1RyYWluIG1vZGVsIGFuZCB0ZXN0IGFmdGVyIHNwbGl0dGluZ1xuXHRYbl90cmFpbiwgWG5fdGVzdCwgeV90cmFpbiwgeV90ZXN0ID0gdHJhaW5fdGVzdF9zcGxpdChYbiwgeSwgdGVzdF9zaXplPTAuMiwgcmFuZG9tX3N0YXRlPTQyKVxuXHRsciA9IGxpbmVhcl9tb2RlbC5Mb2dpc3RpY1JlZ3Jlc3Npb24oKVxuXHRscl9tb2RlbCA9IGxyLmZpdChYbl90cmFpbiwgeV90cmFpbilcblx0cHJpbnQoJ2xvZ2lzdGljIHJlZ3Jlc3Npb24gc2NvcmUgZm9yIHRlc3Qgc2V0OiAlZicgJSBscl9tb2RlbC5zY29yZShYbl90ZXN0LCB5X3Rlc3QpKSJ9

Glossary

Supervised learning: The task of inferring a target variable from predictor variables. For example, inferring the target variable ‘presence of heart disease’ from predictor variables such as ‘age’, ‘sex’, and ‘smoker status’.

Classification task: A supervised learning task is a classification task if the target variable consists of categories (e.g. ‘click’ or ‘not’, ‘malignant’ or ‘benign’ tumour).

Regression task: A supervised learning task is a regression task if the target variable is a continuously varying variable (e.g. price of a house) or an ordered categorical variable such as ‘quality rating of wine’.

k-Nearest Neighbors: An algorithm for classification tasks, in which a data point is assigned the label decided by a majority vote of its k nearest neighbors.

Preprocessing: Any number of operations data scientists will use to get their data into a form more appropriate for what they want to do with it. For example, before performing sentiment analysis of twitter data, you may want to strip out any html tags, white spaces, expand abbreviations and split the tweets into lists of the words they contain.

Centering and Scaling: These are both forms of preprocessing numerical data, that is, data consisting of numbers, as opposed to categories or strings, for example; centering a variable is subtracting the mean of the variable from each data point so that the new variable’s mean is 0; scaling a variable is multiplying each data point by a constant in order to alter the range of the data. See the body of the article for the importance of these, along with examples.

This article was generated from a Jupyter notebook. You can download the notebook here.

Topics

Python Courses

Certification available

Course

Introduction to Python

4 hr
5.3M
Master the basics of data analysis with Python in just four hours. This online course will introduce the Python interface and explore popular packages.
See DetailsRight Arrow
Start Course
See MoreRight Arrow
Related

How to Learn Machine Learning in 2024

Discover how to learn machine learning in 2024, including the key skills and technologies you’ll need to master, as well as resources to help you get started.
Adel Nehme's photo

Adel Nehme

15 min

Seaborn Heatmaps: A Guide to Data Visualization

Learn how to create eye-catching Seaborn heatmaps
Joleen Bothma's photo

Joleen Bothma

9 min

Test-Driven Development in Python: A Beginner's Guide

Dive into test-driven development (TDD) with our comprehensive Python tutorial. Learn how to write robust tests before coding with practical examples.
Amina Edmunds's photo

Amina Edmunds

7 min

Exponents in Python: A Comprehensive Guide for Beginners

Master exponents in Python using various methods, from built-in functions to powerful libraries like NumPy, and leverage them in real-world scenarios to gain a deeper understanding.
Satyam Tripathi's photo

Satyam Tripathi

9 min

OpenCV Tutorial: Unlock the Power of Visual Data Processing

This article provides a comprehensive guide on utilizing the OpenCV library for image and video processing within a Python environment. We dive into the wide range of image processing functionalities OpenCV offers, from basic techniques to more advanced applications.
Richmond Alake's photo

Richmond Alake

13 min

Python Linked Lists: Tutorial With Examples

Learn everything you need to know about linked lists: when to use them, their types, and implementation in Python.
Natassha Selvaraj's photo

Natassha Selvaraj

9 min

See MoreSee More