Skip to main content

Machine Learning with PySpark

Learn how to make predictions with Apache Spark.

Start Course for Free
4 Hours16 Videos56 Exercises14,296 Learners
4550 XP

Create Your Free Account



By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA. You confirm you are at least 16 years old (13 if you are an authorized Classrooms user).

Loved by learners at thousands of companies

Course Description

Spark is a powerful, general purpose tool for working with Big Data. Spark transparently handles the distribution of compute tasks across a cluster. This means that operations are fast, but it also allows you to focus on the analysis rather than worry about technical details. In this course you'll learn how to get data into Spark and then delve into the three fundamental Spark Machine Learning algorithms: Linear Regression, Logistic Regression/Classifiers, and creating pipelines. Along the way you'll analyse a large dataset of flight delays and spam text messages. With this background you'll be ready to harness the power of Spark and apply it on your own Machine Learning projects!

  1. 1



    Spark is a framework for working with Big Data. In this chapter you'll cover some background about Spark and Machine Learning. You'll then find out how to connect to Spark using Python and load CSV data.

    Play Chapter Now
    Machine Learning & Spark
    50 xp
    Characteristics of Spark
    50 xp
    Components in a Spark Cluster
    50 xp
    Connecting to Spark
    50 xp
    Location of Spark master
    50 xp
    Creating a SparkSession
    100 xp
    Loading Data
    50 xp
    Loading flights data
    100 xp
    Loading SMS spam data
    100 xp

In the following tracks

Big Data with PySparkMachine Learning Scientist




Hadrien LacroixMona Khalil
Andrew Collier Headshot

Andrew Collier

Data Scientist @ Exegetic Analytics

Andrew Collier is a Data Scientist, working mostly in R and Python but also dabbling in a wide range of other technologies. When not in front of a computer he spends time with his family and runs obsessively.
See More

What do other learners have to say?

I've used other sites—Coursera, Udacity, things like that—but DataCamp's been the one that I've stuck with.

Devon Edwards Joseph
Lloyds Banking Group

DataCamp is the top resource I recommend for learning data science.

Louis Maiden
Harvard Business School

DataCamp is by far my favorite website to learn from.

Ronald Bowers
Decision Science Analytics, USAA