Skip to main content

Course

Feature Engineering for Machine Learning in Python

IntermediateSkill Level

4.8+

Updated 02/2023

Create new features to improve the performance of your Machine Learning models.

Start Course for Free

PythonMachine Learning

4 hr

16 videos

53 Exercises

4,350 XP

38,881

Statement of Accomplishment

Loved by learners at thousands of companies

Training a Team?

Try for Business

Course Description

Every day you read about the amazing breakthroughs in how the newest applications of machine learning are changing the world. Often this reporting glosses over the fact that a huge amount of data munging and feature engineering must be done before any of these fancy models can be used. In this course, you will learn how to do just that. You will work with Stack Overflow Developers survey, and historic US presidential inauguration addresses, to understand how best to preprocess and engineer features from categorical, continuous, and unstructured data. This course will give you hands-on experience on how to prepare any data for your own machine learning models.

Prerequisites

Supervised Learning with scikit-learn

1

Creating Features

In this chapter, you will explore what feature engineering is and how to get started with applying it to real-world data. You will load, explore and visualize a survey response dataset, and in doing so you will learn about its underlying data types and why they have an influence on how you should engineer your features. Using the pandas package you will create new features from both categorical and continuous columns.

Why generate features?

Getting to know your data

Selecting specific data types

Dealing with categorical features

One-hot encoding and dummy variables

Dealing with uncommon categories

Numeric variables

Binarizing columns

Binning values

2

Dealing with Messy Data

This chapter introduces you to the reality of messy and incomplete data. You will learn how to find where your data has missing values and explore multiple approaches on how to deal with them. You will also use string manipulation techniques to deal with unwanted characters in your dataset.

Why do missing values exist?

How sparse is my data?

Finding the missing values

Dealing with missing values (I)

Listwise deletion

Replacing missing values with constants

Dealing with missing values (II)

Filling continuous missing values

Imputing values in predictive models

Dealing with other data issues

Dealing with stray characters (I)

Dealing with stray characters (II)

Method chaining

3

Conforming to Statistical Assumptions

In this chapter, you will focus on analyzing the underlying distribution of your data and whether it will impact your machine learning pipeline. You will learn how to deal with skewed data and situations where outliers may be negatively impacting your analysis.

Data distributions

What does your data look like? (I)

What does your data look like? (II)

When don't you have to transform your data?

Scaling and transformations

Normalization

Standardization

Log transformation

When can you use normalization?

Removing outliers

Percentage based outlier removal

Statistical outlier removal

Scaling and transforming new data

Train and testing transformations (I)

Train and testing transformations (II)

4

Dealing with Text Data

Finally, in this chapter, you will work with unstructured text data, understanding ways in which you can engineer columnar features out of a text corpus. You will compare how different approaches may impact how much context is being extracted from a text, and how to balance the need for context, without too many features being created.

Encoding text

Cleaning up your text

High level text features

Word counts

Counting words (I)

Counting words (II)

Limiting your features

Text to DataFrame

Term frequency-inverse document frequency

Inspecting Tf-idf values

Transforming unseen data

Using longer n-grams

Finding the most common words

Feature Engineering for Machine Learning in Python

Course
Complete

Earn Statement of Accomplishment

Add this credential to your LinkedIn profile, resume, or CV
Share it on social media and in your performance reviewEnroll Now

Don’t just take our word for it

*4.8

from 996 reviews

84%

14%

2%

0%

0%

Sort by

Daniël

yesterday

Mehr

yesterday

Meïra

yesterday

Giang

2 days ago

Jhoan Sebastian

2 days ago

aiden

2 days ago

Daniël

Meïra

Giang

FAQs

What types of features will I learn to engineer in this course?

You will create features from categorical columns, continuous variables, and unstructured text data, covering the full spectrum of feature types found in real-world machine learning projects.

What datasets are used for hands-on practice?

You will work with the Stack Overflow Developer Survey for structured feature engineering and historic US presidential inauguration addresses for text-based feature creation.

How does this course handle missing data?

Chapter 2 teaches you to locate missing values and explore multiple imputation and removal approaches, along with string manipulation techniques for cleaning messy columns.

Does the course cover statistical assumptions for features?

Yes. Chapter 3 focuses on analyzing data distributions, dealing with skewed data, and handling outliers that could negatively impact your machine learning models.

What text feature engineering techniques are included?

You will learn multiple approaches for extracting columnar features from text corpora, comparing how each method balances context richness against the number of features generated.

Join over 19 million learners and start Feature Engineering for Machine Learning in Python today!

Grow your data skills with DataCamp for Mobile

Make progress on the go with our mobile courses and daily 5-minute coding challenges.