Skip to main content

Course

Big Data Fundamentals with PySpark

AdvancedSkill Level

4.7+

Updated 02/2025

Learn the fundamentals of working with big data with PySpark.

Start Course for Free

SparkData Engineering

4 hr

16 videos

55 Exercises

4,600 XP

65,546

Statement of Accomplishment

Loved by learners at thousands of companies

Training a Team?

Try for Business

Course Description

There's been a lot of buzz about Big Data over the past few years, and it's finally become mainstream for many companies. But what is this Big Data? This course covers the fundamentals of Big Data via PySpark. Spark is a "lightning fast cluster computing" framework for Big Data. It provides a general data processing platform engine and lets you run programs up to 100x faster in memory, or 10x faster on disk, than Hadoop. You’ll use PySpark, a Python package for Spark programming and its powerful, higher-level libraries such as SparkSQL, MLlib (for machine learning), etc. You will explore the works of William Shakespeare, analyze Fifa 2018 data and perform clustering on genomic datasets. At the end of this course, you will have gained an in-depth understanding of PySpark and its application to general Big Data analysis.

Prerequisites

Introduction to Python

1

Introduction to Big Data analysis with Spark

This chapter introduces the exciting world of Big Data, as well as the various concepts and different frameworks for processing Big Data. You will understand why Apache Spark is considered the best framework for BigData.

What is Big Data?

The 3 V's of Big Data

PySpark: Spark with Python

Understanding SparkContext

Interactive Use of PySpark

Loading data in PySpark shell

Review of functional programming in Python

Use of lambda() with map()

Use of lambda() with filter()

2

Programming in PySpark RDD’s

The main abstraction Spark provides is a resilient distributed dataset (RDD), which is the fundamental and backbone data type of this engine. This chapter introduces RDDs and shows how RDDs can be created and executed using RDD Transformations and Actions.

Abstracting Data with RDDs

RDDs from Parallelized collections

RDDs from External Datasets

Partitions in your data

Basic RDD Transformations and Actions

Map and Collect

Filter and Count

Pair RDDs in PySpark

ReduceBykey and Collect

SortByKey and Collect

Advanced RDD Actions

CountingBykeys

Create a base RDD and transform it

Remove stop words and reduce the dataset

Print word frequencies

3

PySpark SQL & DataFrames

In this chapter, you'll learn about Spark SQL which is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine. This chapter shows how Spark SQL allows you to use DataFrames in Python.

Abstracting Data with DataFrames

RDD to DataFrame

Loading CSV into DataFrame

Operating on DataFrames in PySpark

Inspecting data in PySpark DataFrame

PySpark DataFrame subsetting and cleaning

Filtering your DataFrame

Interacting with DataFrames using PySpark SQL

Running SQL Queries Programmatically

SQL queries for filtering Table

Data Visualization in PySpark using DataFrames

PySpark DataFrame visualization

Part 1: Create a DataFrame from CSV file

Part 2: SQL Queries on DataFrame

Part 3: Data visualization

4

Machine Learning with PySpark MLlib

PySpark MLlib is the Apache Spark scalable machine learning library in Python consisting of common learning algorithms and utilities. Throughout this last chapter, you'll learn important Machine Learning algorithms. You will build a movie recommendation engine and a spam filter, and use k-means clustering.

Overview of PySpark MLlib

PySpark ML libraries

PySpark MLlib algorithms

Collaborative filtering

Loading Movie Lens dataset into RDDs

Model training and predictions

Model evaluation using MSE

Classification

Loading spam and non-spam data

Feature hashing and LabelPoint

Logistic Regression model training

Loading and parsing the 5000 points data

K-means training

Visualizing clusters

Congratulations!

Big Data Fundamentals with PySpark

Course
Complete

Earn Statement of Accomplishment

Add this credential to your LinkedIn profile, resume, or CV
Share it on social media and in your performance reviewEnroll Now

Don’t just take our word for it

*4.7

from 221 reviews

78%

18%

2%

1%

0%

Sort by

An

3 days ago

Naomi

3 days ago

Denis

last week

Ignacio

2 weeks ago

Julienne Alicon

2 weeks ago

Kian Gabriel

3 weeks ago

An

Naomi

Denis

FAQs

Do I need prior Big Data experience for this course?

No. This is a beginner-level course. You only need basic Python knowledge, and the course will introduce Big Data concepts and Spark from the ground up.

What PySpark libraries does this course cover?

You will use PySpark core for RDD programming, SparkSQL for structured data queries, and MLlib for basic machine learning tasks.

What datasets are used in the exercises?

You will analyze works of William Shakespeare, explore FIFA 2018 data, and perform clustering on genomic datasets.

What jobs use PySpark skills?

Data engineers, big data developers, and machine learning engineers use PySpark to process and analyze large-scale datasets that do not fit in memory.

How is the course structured?

The course has 4 chapters and 55 exercises covering Big Data fundamentals, RDD programming, SparkSQL, and machine learning with MLlib.

Join over 19 million learners and start Big Data Fundamentals with PySpark today!

Grow your data skills with DataCamp for Mobile

Make progress on the go with our mobile courses and daily 5-minute coding challenges.