Philip Tezaur has completed

Introduction to PySpark

4 hr

2,850 XP

Loved by learners at thousands of companies

Course Description

This course is perfect for data engineers, data scientists, and machine learning practitioners looking to work with large datasets efficiently. Whether you're transitioning from tools like Pandas or diving into big data technologies for the first time, this course offers a solid introduction to PySpark and distributed data processing.

Why Spark? Why Now?

Discover the speed and scalability of Apache Spark, the powerful framework designed for handling big data. Through interactive lessons and hands-on exercises, you'll see how Spark's in-memory processing gives it an edge over traditional frameworks like Hadoop. You'll start by setting up Spark sessions and dive into core components like Resilient Distributed Datasets (RDDs) and DataFrames. Learn to filter, group, and join datasets with ease while working on real-world examples.

Boost Your Python and SQL Skills for Big Data

Learn how to harness PySpark SQL for querying and managing data using familiar SQL syntax. Tackle schemas, complex data types, and user-defined functions (UDFs), all while building skills in caching and optimizing performance for distributed systems.

Build Your Big Data Foundations

By the end of this course, you'll have the confidence to handle, query, and process big data using PySpark. With these foundational skills, you'll be ready to explore advanced topics like machine learning and big data analytics.

For Business

Training 2 or more people?

Get your team access to the full DataCamp platform, including all the features.

1
Introduction to Apache Spark and PySpark
Free
A General introduction to PySpark and distributed computing. This section introduces PySpark, PySpark DataFrames, and RDDs.
Play Chapter Now
Introduction to PySpark
50 xp
Creating a SparkSession
100 xp
Loading census data
100 xp
Introduction to PySpark DataFrames
50 xp
Scalability and performance
50 xp
Reading a CSV and performing aggregations
100 xp
Filtering by company
100 xp
More on Spark DataFrames
50 xp
Infer and filter
100 xp
Schema writeout
100 xp
2
PySpark in Python
A continuation of DataFrames and complex datatypes. This section expands on what DataFrames offer in PySpark and introduces some Spark SQL concepts.
Play Chapter Now
Data manipulation with DataFrames
50 xp
Handling missing data with fill and drop
100 xp
Column operations - creating and renaming columns
100 xp
Advanced DataFrame operations
50 xp
DataFrame combinations
50 xp
Joining flights with their destination airports
100 xp
U define it? U use it!
50 xp
UDF defined
50 xp
Integers in PySpark UDFs
100 xp
Pandas UDFs
100 xp
3
Introduction to PySpark SQL
Delve into leveraging Spark SQL and PySpark for scalable data processing, combining SQL's simplicity with PySpark's distributed computing power to handle large datasets efficiently.
Play Chapter Now
Resilient distributed datasets in PySpark
50 xp
Creating RDDs
100 xp
Collecting RDDs
100 xp
Intro to Spark SQL
50 xp
Querying on a temp view
100 xp
Running SQL on DataFrames
100 xp
Analytics with SQL on DataFrames
100 xp
PySpark aggregations
50 xp
Aggregating in PySpark
100 xp
Aggregating in RDDs
100 xp
Complex Aggregations
100 xp
PySpark at scale
50 xp
Broadcasting
50 xp
Bringing it all together I
100 xp
Bringing it all together II
100 xp
What have we learned?
50 xp

For Business

Training 2 or more people?

Get your team access to the full DataCamp platform, including all the features.

datasets

Transportation Salaries Adults Course Glossary

collaborators

George Boorman

Arne Warnke

Katerina Zahradova

prerequisites

Introduction to SQL Data Manipulation with pandas

Ben Schmidt

Data Engineer

As a data professional of varied disciplines, Benjamin is enthusiastic about solving problems with data and then teaching others about the tools involved. He holds a Masters in Data Science and is a passionate advocate for life-long learning.

Join over 18 million learners and start Introduction to PySpark today!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.

Introduction to PySpark

Loved by learners at thousands of companies

Course Description

Why Spark? Why Now?

Boost Your Python and SQL Skills for Big Data

Build Your Big Data Foundations

.css-10r9e5n{-webkit-margin-end:8px;margin-inline-end:8px;}.css-1309hh9{-webkit-flex-shrink:0;-ms-flex-negative:0;flex-shrink:0;-webkit-margin-end:8px;margin-inline-end:8px;}Training 2 or more people?

Introduction to Apache Spark and PySpark

PySpark in Python