Skip to main content

New Track: Big Data with PySpark

Joyce Chiu,
November 4, 2019 min read
Become proficient in Apache Spark in Python with our first data engineering skill track.

Why learn Apache Spark?

Apache Spark is a fast, easy-to-use general engine for processing big data that has built-in modules for streaming, SQL, machine learning (ML), and graph processing. Since it’s a framework for processing big data, it’s especially useful for data engineers who work with large datasets. It’s also useful for data scientists to perform exploratory data analysis, feature extraction, and ML. Start our Big Data with PySpark track now.

What you’ll learn

This track contains the following courses: Introduction to PySpark: Learn how to use PySpark, the Python API for Spark, for parallel computation with large datasets, and get ready for high-performance machine learning.

Big Data Fundamentals via PySpark: Big data has been a buzzword for many years—discover how PySpark applies to big data analysis.

Cleaning Data with PySpark: Discover what’s needed to prepare data processes using Python with Apache Spark, including the key terminology, methods, and best practices.

Feature Engineering with PySpark: Working with large datasets involves a lot of time spent on data wrangling and feature engineering—PySpark can help with that!

Machine Learning with PySpark: Learn how to get data into Spark and then delve into the three fundamental machine learning algorithms: linear regression, logistic regression, and pipeline creation.

Building Recommendation Engines in PySpark: Learn how to build recommendation engines using Alternating Least Squares in PySpark.

Ready to get started? Start our Big Data with PySpark track now. For additional reading, check out our tutorial on machine learning with PySpark.