Home SparkBig Data Fundamentals with PySpark

Big Data Fundamentals with PySpark

Learn the fundamentals of working with big data with PySpark.

Start Course for Free

4 Hours16 Videos55 Exercises

46,243 LearnersStatement of Accomplishment

Create Your Free Account

Google LinkedIn Facebook

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.

Training 2 or more people?Try DataCamp For Business

Loved by learners at thousands of companies

Course Description

There's been a lot of buzz about Big Data over the past few years, and it's finally become mainstream for many companies. But what is this Big Data? This course covers the fundamentals of Big Data via PySpark. Spark is a "lightning fast cluster computing" framework for Big Data. It provides a general data processing platform engine and lets you run programs up to 100x faster in memory, or 10x faster on disk, than Hadoop. You’ll use PySpark, a Python package for Spark programming and its powerful, higher-level libraries such as SparkSQL, MLlib (for machine learning), etc. You will explore the works of William Shakespeare, analyze Fifa 2018 data and perform clustering on genomic datasets. At the end of this course, you will have gained an in-depth understanding of PySpark and its application to general Big Data analysis.

For Business

Training 2 or more people?

Get your team access to the full DataCamp library, with centralized reporting, assignments, projects and more

1
Introduction to Big Data analysis with Spark
Free
This chapter introduces the exciting world of Big Data, as well as the various concepts and different frameworks for processing Big Data. You will understand why Apache Spark is considered the best framework for BigData.
Play Chapter Now
What is Big Data?
50 xp
The 3 V's of Big Data
50 xp
PySpark: Spark with Python
50 xp
Understanding SparkContext
100 xp
Interactive Use of PySpark
100 xp
Loading data in PySpark shell
100 xp
Review of functional programming in Python
50 xp
Use of lambda() with map()
100 xp
Use of lambda() with filter()
100 xp
2
Programming in PySpark RDD’s
The main abstraction Spark provides is a resilient distributed dataset (RDD), which is the fundamental and backbone data type of this engine. This chapter introduces RDDs and shows how RDDs can be created and executed using RDD Transformations and Actions.
Play Chapter Now
Abstracting Data with RDDs
50 xp
RDDs from Parallelized collections
100 xp
RDDs from External Datasets
100 xp
Partitions in your data
100 xp
Basic RDD Transformations and Actions
50 xp
Map and Collect
100 xp
Filter and Count
100 xp
Pair RDDs in PySpark
50 xp
ReduceBykey and Collect
100 xp
SortByKey and Collect
100 xp
Advanced RDD Actions
50 xp
CountingBykeys
100 xp
Create a base RDD and transform it
100 xp
Remove stop words and reduce the dataset
100 xp
Print word frequencies
100 xp
3
PySpark SQL & DataFrames
In this chapter, you'll learn about Spark SQL which is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine. This chapter shows how Spark SQL allows you to use DataFrames in Python.
Play Chapter Now
Abstracting Data with DataFrames
50 xp
RDD to DataFrame
100 xp
Loading CSV into DataFrame
100 xp
Operating on DataFrames in PySpark
50 xp
Inspecting data in PySpark DataFrame
100 xp
PySpark DataFrame subsetting and cleaning
100 xp
Filtering your DataFrame
100 xp
Interacting with DataFrames using PySpark SQL
50 xp
Running SQL Queries Programmatically
100 xp
SQL queries for filtering Table
100 xp
Data Visualization in PySpark using DataFrames
50 xp
PySpark DataFrame visualization
100 xp
Part 1: Create a DataFrame from CSV file
100 xp
Part 2: SQL Queries on DataFrame
100 xp
Part 3: Data visualization
100 xp
4
Machine Learning with PySpark MLlib
PySpark MLlib is the Apache Spark scalable machine learning library in Python consisting of common learning algorithms and utilities. Throughout this last chapter, you'll learn important Machine Learning algorithms. You will build a movie recommendation engine and a spam filter, and use k-means clustering.
Play Chapter Now
Overview of PySpark MLlib
50 xp
PySpark ML libraries
50 xp
PySpark MLlib algorithms
100 xp
Collaborative filtering
50 xp
Loading Movie Lens dataset into RDDs
100 xp
Model training and predictions
100 xp
Model evaluation using MSE
100 xp
Classification
50 xp
Loading spam and non-spam data
100 xp
Feature hashing and LabelPoint
100 xp
Logistic Regression model training
100 xp
Clustering
50 xp
Loading and parsing the 5000 points data
100 xp
K-means training
100 xp
Visualizing clusters
100 xp
Congratulations!
50 xp

In the following tracks

Big Data with PySpark

Datasets

Complete Shakespeare Movie ratings 5000 points FIFA 2018 People Spam Ham

Collaborators

Hadrien Lacroix

Chester Ismay

Prerequisites

Introduction to Python

Upendra Kumar Devisetty

Science Analyst at CyVerse

Upendra Kumar Devisetty is a Science Analyst at CyVerse where he scientifically interacts with biologists, bioinformaticians, programming teams and other members of CyVerse team. He also coordinates development across projects, and facilitates integration and cross-communication. His current work mainly focuses on integrative analysis of Big Data using high-throughput methods on advanced computing systems. As scientific computing is becoming indispensable for Big Data research, he started building a community to develop and propagate a set of best practices, including continuous testing, version control, virtualization, sharing code through notebooks, and standard data structures.

What do other learners have to say?

Join over 13 million learners and start Big Data Fundamentals with PySpark today!

Create Your Free Account

Google LinkedIn Facebook

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.

Course Description

.css-1goj2uy{margin-right:8px;}Group.css-gnv7tt{font-size:20px;font-weight:700;white-space:nowrap;}.css-12nwtlk{box-sizing:border-box;margin:0;min-width:0;color:#05192D;font-size:16px;line-height:1.5;font-size:20px;font-weight:700;white-space:nowrap;}Training 2 or more people?

Introduction to Big Data analysis with Spark

Programming in PySpark RDD’s

PySpark SQL & DataFrames

Machine Learning with PySpark MLlib

What do other learners have to say?

Join over .css-ou6dz6{color:#03ef62;}13 million learners and start Big Data Fundamentals with PySpark today!

Create Your Free Account

Training 2 or more people?

Join over 13 million learners and start Big Data Fundamentals with PySpark today!