Interactive Course

Cleaning Data with Apache Spark in Python

Learn how to clean data with Apache Spark in Python.

  • 4 hours
  • 16 Videos
  • 53 Exercises
  • 1,069 Participants
  • 4,150 XP

Loved by learners at thousands of top companies:

rei-grey.svg
t-mobile-grey.svg
siemens-grey.svg
credit-suisse-grey.svg
forrester-grey.svg
axa-grey.svg

Course Description

Working with data is tricky - working with millions or even billions of rows is worse. Did you receive some data processing code written on a laptop with fairly pristine data? Chances are you’ve probably been put in charge of moving a basic data process from prototype to production. You may have worked with real world datasets, with missing fields, bizarre formatting, and orders of magnitude more data. Even if this is all new to you, this course helps you learn what’s needed to prepare data processes using Python with Apache Spark. You’ll learn terminology, methods, and some best practices to create a performant, maintainable, and understandable data processing platform.

  1. 1

    DataFrame details

    Free

    A review of DataFrame fundamentals and the importance of data cleaning.

  2. Manipulating DataFrames in the real wold

    A look at various techniques to modify the contents of DataFrames in Spark.

  3. Improving Performance

    Improve data cleaning tasks by increasing performance or reducing resource requirements.

  4. Complex processing and data pipelines

    Learn how to process complex real-world data using Spark and the basics of pipelines.

What do other learners have to say?

Devon

“I've used other sites, but DataCamp's been the one that I've stuck with.”

Devon Edwards Joseph

Lloyd's Banking Group

Louis

“DataCamp is the top resource I recommend for learning data science.”

Louis Maiden

Harvard Business School

Ronbowers

“DataCamp is by far my favorite website to learn from.”

Ronald Bowers

Decision Science Analytics @ USAA

Mike Metzger
Mike Metzger

Data Engineer Consultant @ Flexible Creations

Mike is a consultant focusing on data engineering and analysis using SQL, Python, and Apache Spark among other technologies. He has a 20+ year history of working with various technologies in the data, networking, and security space.

See More
Icon Icon Icon professional info
Do you have 5 minutes to help us improve our navigation?
I'll do it No thanks