Home SparkIntroduction to PySpark

Introduction to PySpark

Learn to implement distributed data management and machine learning in Spark using the PySpark package.

Start Course for Free

4 Hours45 Exercises

133,417 LearnersStatement of Accomplishment

Create Your Free Account

Google LinkedIn Facebook

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.

Training 2 or more people?Try DataCamp For Business

Loved by learners at thousands of companies

Course Description

In this course, you'll learn how to use Spark from Python! Spark is a tool for doing parallel computation with large datasets and it integrates well with Python. PySpark is the Python package that makes the magic happen. You'll use this package to work with data about flights from Portland and Seattle. You'll learn to wrangle this data and build a whole machine learning pipeline to predict whether or not flights will be delayed. Get ready to put some Spark in your Python code and dive into the world of high-performance machine learning!

For Business

Training 2 or more people?

Get your team access to the full DataCamp library, with centralized reporting, assignments, projects and more

In the following Tracks

Big Data with PySpark

Go To Track

Machine Learning Scientist with Python

Go To Track

1
Getting to know PySpark
Free
In this chapter, you'll learn how Spark manages data and how can you read and write tables from Python.
Play Chapter Now
What is Spark, anyway?
50 xp
Using Spark in Python
50 xp
Examining The SparkContext
100 xp
Using DataFrames
50 xp
Creating a SparkSession
100 xp
Viewing tables
100 xp
Are you query-ious?
100 xp
Pandafy a Spark DataFrame
100 xp
Put some Spark in your data
100 xp
Dropping the middle man
100 xp
2
Manipulating data
In this chapter, you'll learn about the pyspark.sql module, which provides optimized data queries to your Spark session.
Play Chapter Now
Creating columns
100 xp
SQL in a nutshell
50 xp
SQL in a nutshell (2)
50 xp
Filtering Data
100 xp
Selecting
100 xp
Selecting II
100 xp
Aggregating
100 xp
Aggregating II
100 xp
Grouping and Aggregating I
100 xp
Grouping and Aggregating II
100 xp
Joining
50 xp
Joining II
100 xp
3
Getting started with machine learning pipelines
PySpark has built-in, cutting-edge machine learning routines, along with utilities to create full machine learning pipelines. You'll learn about them in this chapter.
Play Chapter Now
Machine Learning Pipelines
50 xp
Join the DataFrames
100 xp
Data types
50 xp
String to integer
100 xp
Create a new column
100 xp
Making a Boolean
100 xp
Strings and factors
50 xp
Carrier
100 xp
Destination
100 xp
Assemble a vector
100 xp
Create the pipeline
100 xp
Test vs. Train
50 xp
Transform the data
100 xp
Split the data
100 xp
4
Model tuning and selection
In this last chapter, you'll apply what you've learned to create a model that predicts which flights will be delayed.
Play Chapter Now
What is logistic regression?
50 xp
Create the modeler
100 xp
Cross validation
50 xp
Create the evaluator
100 xp
Make a grid
100 xp
Make the validator
100 xp
Fit the model(s)
100 xp
Evaluating binary classifiers
50 xp
Evaluate the model
100 xp

For Business

Training 2 or more people?

Get your team access to the full DataCamp library, with centralized reporting, assignments, projects and more

In the following Tracks

Big Data with PySpark

Go To Track

Machine Learning Scientist with Python

Go To Track

Datasets

Airports Flights Planes

Collaborators

Colin Ricardo

Prerequisites

Introduction to Python

Lore Dirick

Director of Data Science Education at Flatiron School

Lore is a data scientist with expertise in applied finance. She obtained her PhD in Business Economics and Statistics at KU Leuven, Belgium. During her PhD, she collaborated with several banks working on advanced methods for the analysis of credit risk data. Lore formerly worked as a Data Science Curriculum Lead at DataCamp, and is and is now Director of Data Science Education at Flatiron School, a coding school with branches in 8 cities and online programs.

Nick Solomon

Data Scientist

Nick has a degree in mathematics with a concentration in statistics from Reed College. He's worked on many data science projects in the past, doing everything from mapping crime data to developing new kinds of models for social networks. He's currently a data scientist in the New York City area.

What do other learners have to say?

FAQs

Join over 13 million learners and start Introduction to PySpark today!

Create Your Free Account

Google LinkedIn Facebook

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.

Introduction to PySpark

Create Your Free Account

Loved by learners at thousands of companies

Course Description

Training 2 or more people?

In the following Tracks

Big Data with PySpark

Machine Learning Scientist with Python

Getting to know PySpark

Manipulating data

Getting started with machine learning pipelines

Model tuning and selection

Training 2 or more people?

In the following Tracks

Big Data with PySpark

Machine Learning Scientist with Python

What do other learners have to say?

FAQs

Is this course suitable for beginners?

Will I receive a certificate at the end of the course?

Who will benefit from this course?

What topics are covered in this course?

Is knowledge of Python necessary for this course?

What kind of useful applications can I create with the skills I learn in this course?

Join over 13 million learners and start Introduction to PySpark today!

Create Your Free Account

Course Description

.css-1goj2uy{margin-right:8px;}Group.css-gnv7tt{font-size:20px;font-weight:700;white-space:nowrap;}.css-12nwtlk{box-sizing:border-box;margin:0;min-width:0;color:#05192D;font-size:16px;line-height:1.5;font-size:20px;font-weight:700;white-space:nowrap;}Training 2 or more people?

In the following Tracks

Big Data with PySpark

Machine Learning Scientist with Python

Getting to know PySpark

Manipulating data

Getting started with machine learning pipelines

Model tuning and selection

GroupTraining 2 or more people?

In the following Tracks

Big Data with PySpark

Machine Learning Scientist with Python

What do other learners have to say?

FAQs

Who will benefit from this course?

What topics are covered in this course?

Is knowledge of Python necessary for this course?

What kind of useful applications can I create with the skills I learn in this course?

Join over .css-ou6dz6{color:#03ef62;}13 million learners and start Introduction to PySpark today!

Create Your Free Account

Training 2 or more people?

Training 2 or more people?

Join over 13 million learners and start Introduction to PySpark today!