Learn how to clean data with Apache Spark in Python.
By pressing Continue you accept the Terms of Use and Privacy Policy. You also accept that you are aware that your data will be stored outside of the EU and that you are above the age of 16.
Working with data is tricky - working with millions or even billions of rows is worse. Did you receive some data processing code written on a laptop with fairly pristine data? Chances are you’ve probably been put in charge of moving a basic data process from prototype to production. You may have worked with real world datasets, with missing fields, bizarre formatting, and orders of magnitude more data. Even if this is all new to you, this course helps you learn what’s needed to prepare data processes using Python with Apache Spark. You’ll learn terminology, methods, and some best practices to create a performant, maintainable, and understandable data processing platform.
A review of DataFrame fundamentals and the importance of data cleaning.
Improve data cleaning tasks by increasing performance or reducing resource requirements.
A look at various techniques to modify the contents of DataFrames in Spark.
Learn how to process complex real-world data using Spark and the basics of pipelines.
A review of DataFrame fundamentals and the importance of data cleaning.
A look at various techniques to modify the contents of DataFrames in Spark.
Improve data cleaning tasks by increasing performance or reducing resource requirements.
Learn how to process complex real-world data using Spark and the basics of pipelines.
“I've used other sites, but DataCamp's been the one that I've stuck with.”
Devon Edwards Joseph
Lloyd's Banking Group
“DataCamp is the top resource I recommend for learning data science.”
Louis Maiden
Harvard Business School
“DataCamp is by far my favorite website to learn from.”
Ronald Bowers
Decision Science Analytics @ USAA