Skip to main content

Feature Engineering with PySpark

Learn the gritty details that data scientists are spending 70-80% of their time on; data wrangling and feature engineering.

Start Course for Free
4 Hours16 Videos60 Exercises9,832 Learners5000 XPBig Data with PySpark Track

Create Your Free Account



By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA. You confirm you are at least 16 years old (13 if you are an authorized Classrooms user).

Loved by learners at thousands of companies

Course Description

The real world is messy and your job is to make sense of it. Toy datasets like MTCars and Iris are the result of careful curation and cleaning, even so the data needs to be transformed for it to be useful for powerful machine learning algorithms to extract meaning, forecast, classify or cluster. This course will cover the gritty details that data scientists are spending 70-80% of their time on; data wrangling and feature engineering. With size of datasets now becoming ever larger, let's use PySpark to cut this Big Data problem down to size!

  1. 1

    Exploratory Data Analysis


    Get to know a bit about your problem before you dive in! Then learn how to statistically and visually inspect your dataset!

    Play Chapter Now
    Where to Begin
    50 xp
    Where to begin?
    50 xp
    Check Version
    100 xp
    Load in the data
    100 xp
    Defining A Problem
    50 xp
    What are we predicting?
    100 xp
    Verifying Data Load
    100 xp
    Verifying DataTypes
    100 xp
    Visually Inspecting Data / EDA
    50 xp
    Using Corr()
    100 xp
    Using Visualizations: distplot
    100 xp
    Using Visualizations: lmplot
    100 xp
  2. 3

    Feature Engineering

    In this chapter learn how to create new features for your machine learning model to learn from. We'll look at generating them by combining fields, extracting values from messy columns or encoding them for better results.

    Play Chapter Now

In the following tracks

Big Data with PySpark


nicksolomonNick SolomonadriansotoAdrián Soto
John Hogue Headshot

John Hogue

Lead Data Scientist, General Mills

I have a strong drive for innovation and giving back. Through my work I enjoy building out a career path and center of excellence for those in data science at General Mills. I have a passion for taking action and challenging the status quo with fact based analysis to drive results. Outside of work I enjoy running an organization that gives aspiring and practicing data scientists opportunities to show case their skills in a meaningful way.
See More

What do other learners have to say?

I've used other sites—Coursera, Udacity, things like that—but DataCamp's been the one that I've stuck with.

Devon Edwards Joseph
Lloyds Banking Group

DataCamp is the top resource I recommend for learning data science.

Louis Maiden
Harvard Business School

DataCamp is by far my favorite website to learn from.

Ronald Bowers
Decision Science Analytics, USAA