Skip to main content

Course

Building Recommendation Engines with PySpark

AdvancedSkill Level

4.8+

Updated 04/2026

Learn tools and techniques to leverage your own big data to facilitate positive experiences for your users.

Start Course for Free

SparkMachine Learning

4 hr

15 videos

56 Exercises

4,550 XP

14,110

Statement of Accomplishment

Loved by learners at thousands of companies

Training a Team?

Try for Business

Course Description

This course will show you how to build recommendation engines using Alternating Least Squares in PySpark. Using the popular MovieLens dataset and the Million Songs dataset, this course will take you step by step through the intuition of the Alternating Least Squares algorithm as well as the code to train, test and implement ALS models on various types of customer data.

Prerequisites

Supervised Learning with scikit-learn Introduction to PySpark

1

Recommendations Are Everywhere

This chapter will show you how powerful recommendations engines can be, and provide important distinctions between collaborative-filtering engines and content-based engines as well as the different types of implicit and explicit data that recommendation engines can use. You will also learn a very powerful way to uncover hidden features (latent features) that you may not even know exist in customer datasets.

Why learn how to build recommendation engines?

See the power of a recommendation engine

Power of recommendation engines

Recommendation engine types and data types

Collaborative vs content-based filtering

Collaborative vs content based filtering part II

Implicit vs explicit data

Ratings data types

Uses for recommendation engines

Alternate uses of recommendation engines.

Confirm understanding of latent features

2

How does ALS work?

In this chapter you will review basic concepts of matrix multiplication and matrix factorization, and dive into how the Alternating Least Squares algorithm works and what arguments and hyperparameters it uses to return the best recommendations possible. You will also learn important techniques for properly preparing your data for ALS in Spark.

Overview of matrix multiplication

Matrix multiplication

Matrix multiplication part II

Overview of matrix factorization

Matrix factorization

Non-negative matrix factorization

How ALS alternates to generate predictions

Estimating recommendations

RMSE as ALS alternates

Data preparation for Spark ALS

Correct format and distinct users

Assigning integer id's to movies

ALS parameters and hyperparameters

Build out an ALS model

Build RMSE evaluator

3

Recommending Movies

In this chapter you will be introduced to the MovieLens dataset. You will walk through how to assess it's use for ALS, build out a full cross-validated ALS model on it, and learn how to evaluate it's performance. This will be the foundation for all subsequent ALS models you build using Pyspark.

Introduction to the MovieLens dataset

Viewing the MovieLens Data

Calculate sparsity

The GroupBy and Filter methods

MovieLens Summary Statistics

View Schema

ALS model buildout on MovieLens Data

Create test/train splits and build your ALS model

Tell Spark how to tune your ALS model

Build your cross validation pipeline

Best Model and Best Model Parameters

Model Performance Evaluation

Generate predictions and calculate RMSE

Interpreting the RMSE

Do recommendations make sense

4

What if you don't have customer ratings?

In most real-life situations, you won't not have "perfect" customer data available to build an ALS model. This chapter will teach you how to use your customer behavior data to "infer" customer ratings and use those inferred ratings to build an ALS recommendation engine. Using the Million Songs Dataset as well as another version of the MovieLens dataset, this chapter will show you how to use the data available to you to build a recommendation engine using ALS and evaluate it's performance.

Introduction to the Million Songs Dataset

Confirm understanding of implicit rating concepts

MSD summary statistics

Grouped summary statistics

Evaluating implicit ratings models

Specify ALS hyperparameters

Build implicit models

Running a cross-validated implicit ALS model

Extracting parameters

Overview of binary, implicit ratings

Binary model performance

Recommendations from binary data

Course recap

Building Recommendation Engines with PySpark

Course
Complete

Earn Statement of Accomplishment

Add this credential to your LinkedIn profile, resume, or CV
Share it on social media and in your performance reviewEnroll Now

Don’t just take our word for it

*4.8

from 233 reviews

86%

12%

1%

0%

0%

Sort by

Takumi

last week

Ra'ed

last week

Andreas

3 weeks ago

Halyna

3 weeks ago

Phuc

4 weeks ago

Egor

4 weeks ago

Takumi

Ra'ed

Andreas

FAQs

What recommendation algorithm does this PySpark course focus on?

The course focuses on the Alternating Least Squares (ALS) algorithm for collaborative filtering, covering its mathematical foundation, hyperparameters, and implementation in PySpark.

What datasets are used for building recommendation engines?

You will work with the MovieLens dataset to build and evaluate a cross-validated ALS model, and the Million Songs dataset to practice with implicit feedback data.

Does the course cover recommendations when explicit ratings are not available?

Yes. The final chapter teaches you how to infer ratings from customer behavior data and build ALS recommendation engines using implicit feedback.

What PySpark and Python prerequisites should I have?

You need experience with pandas, Intermediate Python, Introduction to PySpark, basic SQL, and supervised learning with scikit-learn. This is an advanced-level course.

What is matrix factorization and why does it matter for recommendations?

Matrix factorization decomposes a large user-item matrix into smaller matrices to uncover latent features. It is the mathematical core of ALS and helps predict missing ratings.

Join over 19 million learners and start Building Recommendation Engines with PySpark today!

Grow your data skills with DataCamp for Mobile

Make progress on the go with our mobile courses and daily 5-minute coding challenges.