Skip to main content
HomeSpark

Cleaning Data with PySpark

4.1+
19 reviews
Advanced

Learn how to clean data with Apache Spark in Python.

Start Course for Free
4 hours16 videos53 exercises27,483 learnersTrophyStatement of Accomplishment

Create Your Free Account

GoogleLinkedInFacebook

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.
Group

Training 2 or more people?

Try DataCamp for Business

Loved by learners at thousands of companies


Course Description

Working with data is tricky - working with millions or even billions of rows is worse. Did you receive some data processing code written on a laptop with fairly pristine data? Chances are you’ve probably been put in charge of moving a basic data process from prototype to production. You may have worked with real world datasets, with missing fields, bizarre formatting, and orders of magnitude more data. Even if this is all new to you, this course helps you learn what’s needed to prepare data processes using Python with Apache Spark. You’ll learn terminology, methods, and some best practices to create a performant, maintainable, and understandable data processing platform.
For Business

Training 2 or more people?

Get your team access to the full DataCamp platform, including all the features.
DataCamp for BusinessFor a bespoke solution book a demo.

In the following Tracks

Big Data with PySpark

Go To Track
  1. 1

    DataFrame details

    Free

    A review of DataFrame fundamentals and the importance of data cleaning.

    Play Chapter Now
    Intro to data cleaning with Apache Spark
    50 xp
    Data cleaning review
    50 xp
    Defining a schema
    100 xp
    Immutability and lazy processing
    50 xp
    Immutability review
    50 xp
    Using lazy processing
    100 xp
    Understanding Parquet
    50 xp
    Saving a DataFrame in Parquet format
    100 xp
    SQL and Parquet
    100 xp
For Business

Training 2 or more people?

Get your team access to the full DataCamp platform, including all the features.

In the following Tracks

Big Data with PySpark

Go To Track

datasets

Dallas Council VotesDallas Council VotersFlights - 2014Flights - 2015Flights - 2016Flights - 2017

collaborators

Collaborator's avatar
Hadrien Lacroix
Collaborator's avatar
Hillary Green-Lerman
Mike Metzger HeadshotMike Metzger

Data Engineer Consultant @ Flexible Creations

Mike is a consultant focusing on data engineering and analysis using SQL, Python, and Apache Spark among other technologies. He has a 20+ year history of working with various technologies in the data, networking, and security space.
See More

Don’t just take our word for it

*4.1
from 19 reviews
53%
21%
16%
11%
0%
  • Flor S.
    24 days

    Best part for me is the interactive part where you get to apply immediately what was taught in the course through virtual coding.

  • Syed O.
    7 months

    I did learn alot from the course and it definitely talked about many pyspark features not mentioned in other courses however more explaination with examples for tougher and complicated topics in the course would have been better

  • André S.
    9 months

    Eu aprendi demais com esse curso. Gostei muito dos laboratórios também.

  • Douglas L.
    over 1 year

    Very Good Content.

  • Jegan D.
    over 1 year

    Very good course with challenging examples. The only problem is that I found it difficult to submit some of my answers or the solution provided. This happened in two different exercises.

"Best part for me is the interactive part where you get to apply immediately what was taught in the course through virtual coding."

Flor S.

"I did learn alot from the course and it definitely talked about many pyspark features not mentioned in other courses however more explaination with examples for tougher and complicated topics in the course would have been better"

Syed O.

"Eu aprendi demais com esse curso. Gostei muito dos laboratórios também."

André S.

Join over 15 million learners and start Cleaning Data with PySpark today!

Create Your Free Account

GoogleLinkedInFacebook

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.