Skip to main content

Cleaning Data with PySpark

Learn how to clean data with Apache Spark in Python.

Start Course for Free
4 Hours16 Videos53 Exercises15,426 Learners4150 XPBig Data with PySpark TrackData Engineer Track

Create Your Free Account



By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA. You confirm you are at least 16 years old (13 if you are an authorized Classrooms user).

Loved by learners at thousands of companies

Course Description

Working with data is tricky - working with millions or even billions of rows is worse. Did you receive some data processing code written on a laptop with fairly pristine data? Chances are you’ve probably been put in charge of moving a basic data process from prototype to production. You may have worked with real world datasets, with missing fields, bizarre formatting, and orders of magnitude more data. Even if this is all new to you, this course helps you learn what’s needed to prepare data processes using Python with Apache Spark. You’ll learn terminology, methods, and some best practices to create a performant, maintainable, and understandable data processing platform.

  1. 1

    DataFrame details


    A review of DataFrame fundamentals and the importance of data cleaning.

    Play Chapter Now
    Intro to data cleaning with Apache Spark
    50 xp
    Data cleaning review
    50 xp
    Defining a schema
    100 xp
    Immutability and lazy processing
    50 xp
    Immutability review
    50 xp
    Using lazy processing
    100 xp
    Understanding Parquet
    50 xp
    Saving a DataFrame in Parquet format
    100 xp
    SQL and Parquet
    100 xp

In the following tracks

Big Data with PySparkData Engineer


hadrien-d4e73b49-bc29-46b7-a485-2f598f38e3b9Hadrien Lacroixhillary-green-lermanHillary Green-Lerman
Mike Metzger Headshot

Mike Metzger

Data Engineer Consultant @ Flexible Creations

Mike is a consultant focusing on data engineering and analysis using SQL, Python, and Apache Spark among other technologies. He has a 20+ year history of working with various technologies in the data, networking, and security space.
See More

What do other learners have to say?

I've used other sites—Coursera, Udacity, things like that—but DataCamp's been the one that I've stuck with.

Devon Edwards Joseph
Lloyds Banking Group

DataCamp is the top resource I recommend for learning data science.

Louis Maiden
Harvard Business School

DataCamp is by far my favorite website to learn from.

Ronald Bowers
Decision Science Analytics, USAA