Skip to content

Building a Recommender System in R

Welcome to this code-along, where we will build a recommender system to recommend movies to users! Through this, you'll learn how to prepare your data, explore it and create the recommender system itself using recommenderLab. There will be time to answer any questions, so please add them!

recommenderLab is an R package that provides a framework to test and develop recommender algorithms. Various algorithms are supported, including User-based collaborative filtering (UBCF), Item-based collaborative filtering (IBCF), Association rule-based recommender (AR), and many more. It was developed in 2016 by Michael Hahsler. Details can be found via this link.

Load packages

library(aws.s3)
library(tidyverse)
library(qdapTools)
library(recommenderlab)

The dataset

Acknowledgements

The datasets are collected by MovieLens, a research site run by GroupLens Research at the University of Minnesota. MovieLens uses "collaborative filtering" technology to make recommendations of movies that you might enjoy and to help you avoid the ones that you won't.

There are several datasets for different purposes. For this demo, we used the full dataset from this webpage, containing user ratings and tags from 62,000 movies. The dataset was updated in 2018.

The dataset for this demo is split into three csv files: movies, ratings, and tags. All three can be joined by the movieId key.

Data Dictionary

movies

variableclassdescription
movieIdnumericThe unique id of the movie
titlecharacterThe title of the movie
genrescharacterThe genres the movie can be categorized in

ratings

Table with movies and their average rating. Movies that received less than two ratings were removed.

variableclassdescription
movieIdnumericThe unique id of the movie
avg_ratingnumericAverage rating received

tags

variableclassdescription
userIdnumericA unique identifier for the user that gave the rating
movieIdnumericThe unique id of the movie
tagnumericThe tag that was given to the movie
timestampnumericThe timestamp on which the user gave the rating

Load your Data

# ratings = s3read_using(FUN = read.csv, bucket = "datacamp-workspacedemo-workspacedemos3-prod", object = "lca-rec-sys/ratings_by_movie.csv")
# movies = s3read_using(FUN = read.csv, bucket = "datacamp-workspacedemo-workspacedemos3-prod", object = "lca-rec-sys/movies.csv")
# tags = s3read_using(FUN = read.csv, bucket = "datacamp-workspacedemo-workspacedemos3-prod", object = "lca-rec-sys/tags.csv")

Data preprocessing

Create two tables:

  • Matrix for recommender model (only numeric values)
  • Cleaned, fully joined table with all data for final output

1. Split genres to one genre per column per movie, only keep numeric values

2. Add average rating to movies, filter out movies without rating

3. Prepare dataset for recommender engine as matrix

4. Retrieve full list of genres as a vector

5. Retrieve top 15 of movie tags to filter out rarely used tags