Feature Engineering with PySpark Course

Name: Feature Engineering with PySpark
Rating: 4.811846689895471 (287 reviews)

Feature Engineering with PySpark

AdvancedSkill Level

4.8+

Updated 01/2026

Learn the gritty details that data scientists are spending 70-80% of their time on; data wrangling and feature engineering.

Course Description

The real world is messy and your job is to make sense of it. Toy datasets like MTCars and Iris are the result of careful curation and cleaning, even so the data needs to be transformed for it to be useful for powerful machine learning algorithms to extract meaning, forecast, classify or cluster. This course will cover the gritty details that data scientists are spending 70-80% of their time on; data wrangling and feature engineering. With size of datasets now becoming ever larger, let's use PySpark to cut this Big Data problem down to size!

Prerequisites

Supervised Learning with scikit-learn Introduction to PySpark

Exploratory Data Analysis

Get to know a bit about your problem before you dive in! Then learn how to statistically and visually inspect your dataset!

Where to Begin

50 XP

Where to begin?

50 XP

Check Version

100 XP

Load in the data

100 XP

Defining A Problem

50 XP

What are we predicting?

100 XP

Verifying Data Load

100 XP

Verifying DataTypes

100 XP

Visually Inspecting Data / EDA

50 XP

Using Corr()

100 XP

Using Visualizations: distplot

100 XP

Using Visualizations: lmplot

100 XP

Start Chapter

Wrangling with Spark Functions

Real data is rarely clean and ready for analysis. In this chapter learn to remove unneeded information, handle missing values and add additional data to your analysis.

Dropping data

50 XP

Dropping a list of columns

100 XP

Using text filters to remove records

100 XP

Filtering numeric fields conditionally

100 XP

Adjusting Data

50 XP

Custom Percentage Scaling

100 XP

Scaling your scalers

100 XP

Correcting Right Skew Data

100 XP

Working with Missing Data

50 XP

Visualizing Missing Data

100 XP

Imputing Missing Data

100 XP

Calculate Missing Percents

100 XP

Getting More Data

50 XP

A Dangerous Join

100 XP

Spark SQL Join

100 XP

Checking for Bad Joins

100 XP

Start Chapter

Feature Engineering

In this chapter learn how to create new features for your machine learning model to learn from. We'll look at generating them by combining fields, extracting values from messy columns or encoding them for better results.

Feature Generation

50 XP

Differences

100 XP

Ratios

100 XP

Deeper Features

100 XP

Time Features

50 XP

Time Components

100 XP

Joining On Time Components

100 XP

Date Math

100 XP

Extracting Features

50 XP

Extracting Text to New Features

100 XP

Splitting & Exploding

100 XP

Pivot & Join

100 XP

Binarizing, Bucketing & Encoding

50 XP

Binarizing Day of Week

100 XP

Bucketing

100 XP

One Hot Encoding

100 XP

Start Chapter

Building a Model

In this chapter we'll learn how to choose which type of model we want. Then we will learn how to apply our data to the model and evaluate it. Lastly, we'll learn how to interpret the results and save the model for later!

Choosing the Algorithm

50 XP

Which MLlib Module?

50 XP

Creating Time Splits

100 XP

Adjusting Time Features

100 XP

Feature Engineering Assumptions for RFR

50 XP

Feature Engineering For Random Forests

50 XP

Dropping Columns with Low Observations

100 XP

Naively Handling Missing and Categorical Values

100 XP

Building a Model

50 XP

Building a Regression Model

100 XP

Evaluating & Comparing Algorithms

100 XP

Understanding Metrics

50 XP

Interpreting, Saving & Loading

50 XP

Interpreting Results

100 XP

Saving & Loading Models

100 XP

Final Thoughts

50 XP

Start Chapter

Feature Engineering with PySpark

Course
Complete

Earn Statement of Accomplishment

Add this credential to your LinkedIn profile, resume, or CV
Share it on social media and in your performance reviewEnroll Now

Don’t just take our word for it

*4.8

from 287 reviews

84%

15%

Sort by

Roy

5 days ago

S.E.

3 weeks ago

Andreas

4 weeks ago

Kristóf

2 months ago

Sun

2 months ago

Mateusz

2 months ago

Roy

S.E.

Andreas

FAQs

What prior experience do I need with PySpark and machine learning?

You should know PySpark basics, pandas, SQL fundamentals, introductory statistics in Python, and supervised learning with scikit-learn before taking this advanced course.

What feature engineering techniques are covered in this course?

You will learn exploratory data analysis, data wrangling with Spark functions, handling missing values, building machine learning pipelines, and creating features for big data models.

Why use PySpark instead of pandas for feature engineering?

PySpark handles datasets too large to fit in memory on a single machine. This course teaches feature engineering at scale for big data problems that pandas cannot handle efficiently.

Does the course cover building end-to-end ML pipelines in PySpark?

Yes. The final chapter focuses on building machine learning pipelines that combine feature transformations with model training, creating reproducible workflows in PySpark.

How many exercises and how much time should I plan for?

The course has 81 exercises across four chapters. Most learners complete it in about four to five hours, reflecting the depth of the material covered.

Feature Engineering with PySpark

Training a Team?

Course Description

Prerequisites

Exploratory Data Analysis

Wrangling with Spark Functions

Feature Engineering

Building a Model

Earn Statement of Accomplishment

Don’t just take our word for it

FAQs

What prior experience do I need with PySpark and machine learning?

What feature engineering techniques are covered in this course?

Why use PySpark instead of pandas for feature engineering?

Does the course cover building end-to-end ML pipelines in PySpark?

How many exercises and how much time should I plan for?

Join over 19 million learners and start Feature Engineering with PySpark today!

Grow your data skills with DataCamp for Mobile

Course Description

Earn Statement of Accomplishment

Don’t just take our word for it

FAQs

What feature engineering techniques are covered in this course?

Why use PySpark instead of pandas for feature engineering?

Does the course cover building end-to-end ML pipelines in PySpark?

How many exercises and how much time should I plan for?

Join over .css-nklxlk{color:var(--wf-brand--main, #03EF62);}19 million learners and start Feature Engineering with PySpark today!

Create Your Free Account

Grow your data skills with DataCamp for Mobile

Join over 19 million learners and start Feature Engineering with PySpark today!