Home Machine LearningMachine Learning with PySpark

Machine Learning with PySpark

Learn how to make predictions from data with Apache Spark, using decision trees, logistic regression, linear regression, ensembles, and pipelines.

Start Course for Free

4 Hours16 Videos56 Exercises

21,800 LearnersStatement of Accomplishment

Create Your Free Account

Google LinkedIn Facebook

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.

Training 2 or more people?Try DataCamp For Business

Loved by learners at thousands of companies

Course Description

Learn to Use Apache Spark for Machine Learning

Spark is a powerful, general purpose tool for working with Big Data. Spark transparently handles the distribution of compute tasks across a cluster. This means that operations are fast, but it also allows you to focus on the analysis rather than worry about technical details. In this course you'll learn how to get data into Spark and then delve into the three fundamental Spark Machine Learning algorithms: Linear Regression, Logistic Regression/Classifiers, and creating pipelines.

Build and Test Decision Trees

Building your own decision trees is a great way to start exploring machine learning models. You’ll use an algorithm called ‘Recursive Partitioning’ to divide data into two classes and find a predictor within your data that results in the most informative split of the two classes, and repeat this action with further nodes. You can then use your decision tree to make predictions with new data.

Master Logistic and Linear Regression in PySpark

Logistic and linear regression are essential machine learning techniques that are supported by PySpark. You’ll learn to build and evaluate logistic regression models, before moving on to creating linear regression models to help you refine your predictors to only the most relevant options.

By the end of the course, you’ll feel confident in applying your new-found machine learning knowledge, thanks to hands-on tasks and practice data sets found throughout the course.

For Business

Training 2 or more people?

Get your team access to the full DataCamp library, with centralized reporting, assignments, projects and more

In the following Tracks

Big Data with PySpark

Go To Track

Machine Learning Scientist with Python

Go To Track

1
Introduction
Free
Spark is a framework for working with Big Data. In this chapter you'll cover some background about Spark and Machine Learning. You'll then find out how to connect to Spark using Python and load CSV data.
Play Chapter Now
Machine Learning & Spark
50 xp
Characteristics of Spark
50 xp
Components in a Spark Cluster
50 xp
Connecting to Spark
50 xp
Location of Spark master
50 xp
Creating a SparkSession
100 xp
Loading Data
50 xp
Loading flights data
100 xp
Loading SMS spam data
100 xp
2
Classification
Now that you are familiar with getting data into Spark, you'll move onto building two types of classification model: Decision Trees and Logistic Regression. You'll also find out about a few approaches to data preparation.
Play Chapter Now
Data Preparation
50 xp
Removing columns and rows
100 xp
Column manipulation
100 xp
Categorical columns
100 xp
Assembling columns
100 xp
Decision Tree
50 xp
Train/test split
100 xp
Build a Decision Tree
100 xp
Evaluate the Decision Tree
100 xp
Logistic Regression
50 xp
Build a Logistic Regression model
100 xp
Evaluate the Logistic Regression model
100 xp
Turning Text into Tables
50 xp
Punctuation, numbers and tokens
100 xp
Stop words and hashing
100 xp
Training a spam classifier
100 xp
3
Regression
Next you'll learn to create Linear Regression models. You'll also find out how to augment your data by engineering new predictors as well as a robust approach to selecting only the most relevant predictors.
Play Chapter Now
One-Hot Encoding
50 xp
Encoding flight origin
100 xp
Encoding shirt sizes
50 xp
Regression
50 xp
Flight duration model: Just distance
100 xp
Interpreting the coefficients
100 xp
Flight duration model: Adding origin airport
100 xp
Interpreting coefficients
100 xp
Bucketing & Engineering
50 xp
Bucketing departure time
100 xp
Flight duration model: Adding departure time
100 xp
Regularization
50 xp
Flight duration model: More features!
100 xp
Flight duration model: Regularization!
100 xp
4
Ensembles & Pipelines
Finally you'll learn how to make your models more efficient. You'll find out how to use pipelines to make your code clearer and easier to maintain. Then you'll use cross-validation to better test your models and select good model parameters. Finally you'll dabble in two types of ensemble model.
Play Chapter Now
Pipeline
50 xp
Flight duration model: Pipeline stages
100 xp
Flight duration model: Pipeline model
100 xp
SMS spam pipeline
100 xp
Cross-Validation
50 xp
Cross validating simple flight duration model
100 xp
Cross validating flight duration model pipeline
100 xp
Grid Search
50 xp
Optimizing flights linear regression
100 xp
Dissecting the best flight duration model
100 xp
SMS spam optimised
100 xp
How many models for grid search?
50 xp
Ensemble
50 xp
Delayed flights with Gradient-Boosted Trees
100 xp
Delayed flights with a Random Forest
100 xp
Evaluating Random Forest
100 xp
Closing thoughts
50 xp

For Business

Training 2 or more people?

Get your team access to the full DataCamp library, with centralized reporting, assignments, projects and more

In the following Tracks

Big Data with PySpark

Go To Track

Machine Learning Scientist with Python

Go To Track

Datasets

Flights SMS

Collaborators

Hadrien Lacroix

Mona Khalil

Prerequisites

Introduction to PySpark Supervised Learning with scikit-learn

Andrew Collier

Data Scientist @ Exegetic Analytics

Andrew Collier is a Data Scientist, working mostly in R and Python but also dabbling in a wide range of other technologies. When not in front of a computer he spends time with his family and runs obsessively.

What do other learners have to say?

FAQs

Join over 13 million learners and start Machine Learning with PySpark today!

Create Your Free Account

Google LinkedIn Facebook

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.

Machine Learning with PySpark

Create Your Free Account

Loved by learners at thousands of companies

Course Description

Learn to Use Apache Spark for Machine Learning

Build and Test Decision Trees

Master Logistic and Linear Regression in PySpark

Training 2 or more people?

In the following Tracks

Big Data with PySpark

Machine Learning Scientist with Python

Introduction

Classification

Regression

Ensembles & Pipelines

Training 2 or more people?

In the following Tracks

Big Data with PySpark

Machine Learning Scientist with Python

What do other learners have to say?

FAQs

Is PySpark good for machine learning?

Is this course suitable for beginners?

Join over 13 million learners and start Machine Learning with PySpark today!

Create Your Free Account

Course Description

Learn to Use Apache Spark for Machine Learning

Build and Test Decision Trees

Master Logistic and Linear Regression in PySpark

.css-1goj2uy{margin-right:8px;}Group.css-gnv7tt{font-size:20px;font-weight:700;white-space:nowrap;}.css-12nwtlk{box-sizing:border-box;margin:0;min-width:0;color:#05192D;font-size:16px;line-height:1.5;font-size:20px;font-weight:700;white-space:nowrap;}Training 2 or more people?

In the following Tracks

Big Data with PySpark

Machine Learning Scientist with Python

Introduction

Classification

Regression

Ensembles & Pipelines

GroupTraining 2 or more people?

In the following Tracks

Big Data with PySpark

Machine Learning Scientist with Python

What do other learners have to say?

FAQs

Join over .css-ou6dz6{color:#03ef62;}13 million learners and start Machine Learning with PySpark today!

Create Your Free Account

Training 2 or more people?

Training 2 or more people?

Join over 13 million learners and start Machine Learning with PySpark today!