Live training

Cleaning Data in Pyspark

Join us for this live, hands-on training where you will learn how to utilize the power of Python and Apache Spark for cleaning data. We'll work through a dataset with a myriad of common issues you would likely encounter while preparing the data for further processing or analysis. This includes handling malformed and missing data, using transformations, and a bit about validation of your datasets. This session will run for three hours, providing time to gain experience with Spark and data cleaning and will include short breaks and Q&A throughout.

Wednesday 17 June, 2 PM EDT, 7 PM BST
Register Now

What will I learn?

You will learn how to:
  • How to efficiently load data into a Pyspark dataframe.
  • How to remove errant rows / columns of data, including comments, missing data, combined or misinterpreted rows, etc.
  • Utilizing user defined functions (UDFs) to run advanced transformations on data.
  • About differences between Spark & Pandas data processing models.

What should I prepare?

Bring your questions regarding processing large amounts of data and a machine running a late version browser.

Who should attend?

This course is open to all DataCamp Premium learners, looking to use Spark and Python to chew through and clean huge datasets. We recommend that you have taken the following course before attending:

  • Introduction to PySpark
  • Data cleaning with PySpark

Presenter Bio

Mike Metzger Headshot

Mike Metzger

Data Engineer Consultant @ Flexible Creations

Mike is a consultant focusing on data engineering and analysis using SQL, Python, and Apache Spark among other technologies. He has a 20+ year history of working with various technologies in the data, networking, and security space. He is also a Jiu Jitsu purple belt and has between 4-10 cows at any given time.
Follow on LinkedIn