
Loved by learners at thousands of companies
Course Description
This course is perfect for data engineers, data scientists, and machine learning practitioners looking to work with large datasets efficiently. Whether you're transitioning from tools like Pandas or diving into big data technologies for the first time, this course offers a solid introduction to PySpark and distributed data processing.
Why Spark? Why Now?
Discover the speed and scalability of Apache Spark, the powerful framework designed for handling big data. Through interactive lessons and hands-on exercises, you'll see how Spark's in-memory processing gives it an edge over traditional frameworks like Hadoop. You'll start by setting up Spark sessions and dive into core components like Resilient Distributed Datasets (RDDs) and DataFrames. Learn to filter, group, and join datasets with ease while working on real-world examples.Boost Your Python and SQL Skills for Big Data
Learn how to harness PySpark SQL for querying and managing data using familiar SQL syntax. Tackle schemas, complex data types, and user-defined functions (UDFs), all while building skills in caching and optimizing performance for distributed systems.Build Your Big Data Foundations
By the end of this course, you'll have the confidence to handle, query, and process big data using PySpark. With these foundational skills, you'll be ready to explore advanced topics like machine learning and big data analytics.Training 2 or more people?
Get your team access to the full DataCamp platform, including all the features.- 1Introduction to Apache Spark and PySparkFreeA General introduction to PySpark and distributed computing. This section introduces PySpark, PySpark DataFrames, and RDDs. 
- 2PySpark in PythonA continuation of DataFrames and complex datatypes. This section expands on what DataFrames offer in PySpark and introduces some Spark SQL concepts. Data manipulation with DataFrames50 xpHandling missing data with fill and drop100 xpColumn operations - creating and renaming columns100 xpAdvanced DataFrame operations50 xpDataFrame combinations50 xpJoining flights with their destination airports100 xpU define it? U use it!50 xpUDF defined50 xpIntegers in PySpark UDFs100 xpPandas UDFs100 xp
- 3Introduction to PySpark SQLDelve into leveraging Spark SQL and PySpark for scalable data processing, combining SQL's simplicity with PySpark's distributed computing power to handle large datasets efficiently. Resilient distributed datasets in PySpark50 xpCreating RDDs100 xpCollecting RDDs100 xpIntro to Spark SQL50 xpQuerying on a temp view100 xpRunning SQL on DataFrames100 xpAnalytics with SQL on DataFrames100 xpPySpark aggregations50 xpAggregating in PySpark100 xpAggregating in RDDs100 xpComplex Aggregations100 xpPySpark at scale50 xpBroadcasting50 xpBringing it all together I100 xpBringing it all together II100 xpWhat have we learned?50 xp
Training 2 or more people?
Get your team access to the full DataCamp platform, including all the features.collaborators


 Ben Schmidt
Ben SchmidtData Engineer
As a data professional of varied disciplines, Benjamin is enthusiastic about solving problems with data and then teaching others about the tools involved. He holds a Masters in Data Science and is a passionate advocate for life-long learning.
Join over 17 million learners and start Introduction to PySpark today!
Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.