Skip to main content

Preprocessing for Machine Learning in Python

In this course you'll learn how to get your cleaned data ready for modeling.

Start Course for Free
4 Hours20 Videos62 Exercises26,559 Learners
4700 XP

Create Your Free Account



By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA. You confirm you are at least 16 years old (13 if you are an authorized Classrooms user).

Loved by learners at thousands of companies

Course Description

This course covers the basics of how and when to perform data preprocessing. This essential step in any machine learning project is when you get your data ready for modeling. Between importing and cleaning your data and fitting your machine learning model is when preprocessing comes into play. You'll learn how to standardize your data so that it's in the right form for your model, create new features to best leverage the information in your dataset, and select the best features to improve your model fit. Finally, you'll have some practice preprocessing by getting a dataset on UFO sightings ready for modeling.

  1. 1

    Introduction to Data Preprocessing


    In this chapter you'll learn exactly what it means to preprocess data. You'll take the first steps in any preprocessing journey, including exploring data types and dealing with missing data.

    Play Chapter Now
    What is data preprocessing?
    50 xp
    Missing data - columns
    50 xp
    Missing data - rows
    100 xp
    Working with data types
    50 xp
    Exploring data types
    50 xp
    Converting a column type
    100 xp
    Class distribution
    50 xp
    Class imbalance
    50 xp
    Stratified sampling
    100 xp
  2. 2

    Standardizing Data

    This chapter is all about standardizing data. Often a model will make some assumptions about the distribution or scale of your features. Standardization is a way to make your data fit these assumptions and improve the algorithm's performance.

    Play Chapter Now
  3. 4

    Selecting features for modeling

    This chapter goes over a few different techniques for selecting the most important features from your dataset. You'll learn how to drop redundant features, work with text vectors, and reduce the number of features in your dataset using principal component analysis (PCA).

    Play Chapter Now

In the following tracks

Machine Learning Scientist


Nick SolomonKara Woo
DataCamp Content Creator Headshot

DataCamp Content Creator

Course Instructor

DataCamp offers interactive R, Python, Spreadsheets, SQL and shell courses. All on topics in data science, statistics, and machine learning. Learn from a team of expert teachers in the comfort of your browser with video lessons and fun coding challenges and projects.
See More

What do other learners have to say?

I've used other sites—Coursera, Udacity, things like that—but DataCamp's been the one that I've stuck with.

Devon Edwards Joseph
Lloyds Banking Group

DataCamp is the top resource I recommend for learning data science.

Louis Maiden
Harvard Business School

DataCamp is by far my favorite website to learn from.

Ronald Bowers
Decision Science Analytics, USAA