Interactive Course

Machine Learning with Apache Spark

Learn how to make predictions with Apache Spark.

  • 4 hours
  • 16 Videos
  • 56 Exercises
  • 2,791 Participants
  • 4,550 XP

Loved by learners at thousands of top companies:

paypal-grey.svg
axa-grey.svg
intel-grey.svg
3m-grey.svg
siemens-grey.svg
whole-foods-grey.svg

Course Description

Spark is a powerful, general purpose tool for working with Big Data. Spark transparently handles the distribution of compute tasks across a cluster. This means that operations are fast, but it also allows you to focus on the analysis rather than worry about technical details. In this course you'll learn how to get data into Spark and then delve into the three fundamental Spark Machine Learning algorithms: Linear Regression, Logistic Regression/Classifiers, and creating pipelines. Along the way you'll analyse a large dataset of flight delays and spam text messages. With this background you'll be ready to harness the power of Spark and apply it on your own Machine Learning projects!

  1. Classification

    Now that you are familiar with getting data into Spark, you'll move onto building two types of classification model: Decision Trees and Logistic Regression. You'll also find out about a few approaches to data preparation.

  2. Ensembles & Pipelines

    Finally you'll learn how to make your models more efficient. You'll find out how to use pipelines to make your code clearer and easier to maintain. Then you'll use cross-validation to better test your models and select good model parameters. Finally you'll dabble in two types of ensemble model.

  1. 1

    Introduction

    Free

    Spark is a framework for working with Big Data. In this chapter you'll cover some background about Spark and Machine Learning. You'll then find out how to connect to Spark using Python and load CSV data.

  2. Classification

    Now that you are familiar with getting data into Spark, you'll move onto building two types of classification model: Decision Trees and Logistic Regression. You'll also find out about a few approaches to data preparation.

  3. Regression

    Next you'll learn to create Linear Regression models. You'll also find out how to augment your data by engineering new predictors as well as a robust approach to selecting only the most relevant predictors.

  4. Ensembles & Pipelines

    Finally you'll learn how to make your models more efficient. You'll find out how to use pipelines to make your code clearer and easier to maintain. Then you'll use cross-validation to better test your models and select good model parameters. Finally you'll dabble in two types of ensemble model.

What do other learners have to say?

Devon

“I've used other sites, but DataCamp's been the one that I've stuck with.”

Devon Edwards Joseph

Lloyd's Banking Group

Louis

“DataCamp is the top resource I recommend for learning data science.”

Louis Maiden

Harvard Business School

Ronbowers

“DataCamp is by far my favorite website to learn from.”

Ronald Bowers

Decision Science Analytics @ USAA

Andrew Collier
Andrew Collier

Data Scientist @ Exegetic Analytics

Andrew Collier is a Data Scientist, working mostly in R and Python but also dabbling in a wide range of other technologies. When not in front of a computer he spends time with his family and runs obsessively.

See More
Collaborators
  • Hadrien Lacroix

    Hadrien Lacroix

  • Mona Khalil

    Mona Khalil

Icon Icon Icon professional info
Do you have 5 minutes to help us improve our navigation?
I'll do it No thanks