Skip to main content
HomeSpark

Course

Big Data Fundamentals with PySpark

AdvancedSkill Level
4.7+
204 reviews
Updated 02/2025
Learn the fundamentals of working with big data with PySpark.
Start Course for Free
SparkData Engineering
4 hr
16 videos
55 Exercises
4,600 XP
64,686
Statement of Accomplishment

Create Your Free Account

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.

Loved by learners at thousands of companies

Group

Training a Team?

Try for Business

Course Description

There's been a lot of buzz about Big Data over the past few years, and it's finally become mainstream for many companies. But what is this Big Data? This course covers the fundamentals of Big Data via PySpark. Spark is a "lightning fast cluster computing" framework for Big Data. It provides a general data processing platform engine and lets you run programs up to 100x faster in memory, or 10x faster on disk, than Hadoop. You’ll use PySpark, a Python package for Spark programming and its powerful, higher-level libraries such as SparkSQL, MLlib (for machine learning), etc. You will explore the works of William Shakespeare, analyze Fifa 2018 data and perform clustering on genomic datasets. At the end of this course, you will have gained an in-depth understanding of PySpark and its application to general Big Data analysis.

Prerequisites

Introduction to Python
1

Introduction to Big Data analysis with Spark

This chapter introduces the exciting world of Big Data, as well as the various concepts and different frameworks for processing Big Data. You will understand why Apache Spark is considered the best framework for BigData.
Start Chapter
2

Programming in PySpark RDD’s

The main abstraction Spark provides is a resilient distributed dataset (RDD), which is the fundamental and backbone data type of this engine. This chapter introduces RDDs and shows how RDDs can be created and executed using RDD Transformations and Actions.
Start Chapter
4

Machine Learning with PySpark MLlib

Big Data Fundamentals with PySpark
Course
Complete

Earn Statement of Accomplishment

Add this credential to your LinkedIn profile, resume, or CV
Share it on social media and in your performance review
Enroll Now

Don’t just take our word for it

*4.7
from 204 reviews
77%
19%
2%
1%
0%
  • Stephen
    3 days ago

  • Junior
    last week

  • Daun
    last week

  • Carissa
    last week

  • Aabrar
    2 weeks ago

  • Haofan
    2 weeks ago

Stephen

Junior

Daun

FAQs

Do I need prior Big Data experience for this course?

No. This is a beginner-level course. You only need basic Python knowledge, and the course will introduce Big Data concepts and Spark from the ground up.

What PySpark libraries does this course cover?

You will use PySpark core for RDD programming, SparkSQL for structured data queries, and MLlib for basic machine learning tasks.

What datasets are used in the exercises?

You will analyze works of William Shakespeare, explore FIFA 2018 data, and perform clustering on genomic datasets.

What jobs use PySpark skills?

Data engineers, big data developers, and machine learning engineers use PySpark to process and analyze large-scale datasets that do not fit in memory.

How is the course structured?

The course has 4 chapters and 55 exercises covering Big Data fundamentals, RDD programming, SparkSQL, and machine learning with MLlib.

Join over 19 million learners and start Big Data Fundamentals with PySpark today!

Create Your Free Account

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.

Grow your data skills with DataCamp for Mobile

Make progress on the go with our mobile courses and daily 5-minute coding challenges.