Ben Bolstad has completed

Preprocessing for Machine Learning in Python

4 hr

4,700 XP

Loved by learners at thousands of companies

Course Description

This course covers the basics of how and when to perform data preprocessing. This essential step in any machine learning project is when you get your data ready for modeling. Between importing and cleaning your data and fitting your machine learning model is when preprocessing comes into play. You'll learn how to standardize your data so that it's in the right form for your model, create new features to best leverage the information in your dataset, and select the best features to improve your model fit. Finally, you'll have some practice preprocessing by getting a dataset on UFO sightings ready for modeling.

For Business

Training 2 or more people?

Get your team access to the full DataCamp platform, including all the features.

1
Introduction to Data Preprocessing
Free
In this chapter you'll learn exactly what it means to preprocess data. You'll take the first steps in any preprocessing journey, including exploring data types and dealing with missing data.
Play Chapter Now
Introduction to preprocessing
50 xp
Exploring missing data
50 xp
Dropping missing data
100 xp
Working with data types
50 xp
Exploring data types
50 xp
Converting a column type
100 xp
Training and test sets
50 xp
Class imbalance
50 xp
Stratified sampling
100 xp
2
Standardizing Data
This chapter is all about standardizing data. Often a model will make some assumptions about the distribution or scale of your features. Standardization is a way to make your data fit these assumptions and improve the algorithm's performance.
Play Chapter Now
Standardization
50 xp
When to standardize
50 xp
Modeling without normalizing
100 xp
Log normalization
50 xp
Checking the variance
50 xp
Log normalization in Python
100 xp
Scaling data for feature comparison
50 xp
Scaling data - investigating columns
50 xp
Scaling data - standardizing columns
100 xp
Standardized data and modeling
50 xp
KNN on non-scaled data
100 xp
KNN on scaled data
100 xp
3
Feature Engineering
In this section you'll learn about feature engineering. You'll explore different ways to create new, more useful, features from the ones already in your dataset. You'll see how to encode, aggregate, and extract information from both numerical and textual features.
Play Chapter Now
Feature engineering
50 xp
Feature engineering knowledge test
50 xp
Identifying areas for feature engineering
50 xp
Encoding categorical variables
50 xp
Encoding categorical variables - binary
100 xp
Encoding categorical variables - one-hot
100 xp
Engineering numerical features
50 xp
Aggregating numerical features
100 xp
Extracting datetime components
100 xp
Engineering text features
50 xp
Extracting string patterns
100 xp
Vectorizing text
100 xp
Text classification using tf/idf vectors
100 xp
4
Selecting Features for Modeling
This chapter goes over a few different techniques for selecting the most important features from your dataset. You'll learn how to drop redundant features, work with text vectors, and reduce the number of features in your dataset using principal component analysis (PCA).
Play Chapter Now
Feature selection
50 xp
When to use feature selection
50 xp
Identifying areas for feature selection
50 xp
Removing redundant features
50 xp
Selecting relevant features
100 xp
Checking for correlated features
100 xp
Selecting features using text vectors
50 xp
Exploring text vectors, part 1
100 xp
Exploring text vectors, part 2
100 xp
Training Naive Bayes with feature selection
100 xp
Dimensionality reduction
50 xp
Using PCA
100 xp
Training a model with PCA
100 xp
5
Putting It All Together
Now that you've learned all about preprocessing you'll try these techniques out on a dataset that records information on UFO sightings.
Play Chapter Now
UFOs and preprocessing
50 xp
Checking column types
100 xp
Dropping missing data
100 xp
Categorical variables and standardization
50 xp
Extracting numbers from strings
100 xp
Identifying features for standardization
100 xp
Engineering new features
50 xp
Encoding categorical variables
100 xp
Features from dates
100 xp
Text vectorization
100 xp
Feature selection and modeling
50 xp
Selecting the ideal dataset
100 xp
Modeling the UFO dataset, part 1
100 xp
Modeling the UFO dataset, part 2
100 xp
Congratulations!
50 xp

For Business

Training 2 or more people?

Get your team access to the full DataCamp platform, including all the features.

datasets

Hiking data Wine data UFO sightings data Volunteering data

collaborators

Nick Solomon

Kara Woo

prerequisites

Cleaning Data in Python Supervised Learning with scikit-learn

James Chapman

AI Curriculum Manager, DataCamp

James is a Curriculum Manager at DataCamp, where he collaborates with experts from industry and academia to create courses on AI, data science, and analytics. He has led nine DataCamp courses on diverse topics in Python, R, AI developer tooling, and Google Sheets. He has a Master's degree in Physics and Astronomy from Durham University, where he specialized in high-redshift quasar detection. In his spare time, he enjoys restoring retro toys and electronics.

Follow James on LinkedIn

Join over 18 million learners and start Preprocessing for Machine Learning in Python today!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.

Preprocessing for Machine Learning in Python

Loved by learners at thousands of companies

Course Description

.css-10r9e5n{-webkit-margin-end:8px;margin-inline-end:8px;}.css-1309hh9{-webkit-flex-shrink:0;-ms-flex-negative:0;flex-shrink:0;-webkit-margin-end:8px;margin-inline-end:8px;}Training 2 or more people?

Introduction to Data Preprocessing

Standardizing Data

Feature Engineering

Selecting Features for Modeling

Putting It All Together

Training 2 or more people?

Join over .css-ou6dz6{color:#03ef62;}18 million learners and start Preprocessing for Machine Learning in Python today!

Create Your Free Account

Training 2 or more people?

Join over 18 million learners and start Preprocessing for Machine Learning in Python today!