Building a Recommender System in R
Welcome to this code-along, where we will build a recommender system to recommend movies to users! Through this, you'll learn how to prepare your data, explore it and create the recommender system itself using recommenderLab. There will be time to answer any questions, so please add them!
recommenderLab is an R package that provides a framework to test and develop recommender algorithms. Various algorithms are supported, including User-based collaborative filtering (UBCF), Item-based collaborative filtering (IBCF), Association rule-based recommender (AR), and many more. It was developed in 2016 by Michael Hahsler. Details can be found via this link.
Load packages
library(aws.s3)
library(tidyverse)
library(qdapTools)
library(recommenderlab)The dataset
Acknowledgements
The datasets are collected by MovieLens, a research site run by GroupLens Research at the University of Minnesota. MovieLens uses "collaborative filtering" technology to make recommendations of movies that you might enjoy and to help you avoid the ones that you won't.
There are several datasets for different purposes. For this demo, we used the full dataset from this webpage, containing user ratings and tags from 62,000 movies. The dataset was updated in 2018.
The dataset for this demo is split into three csv files: movies, ratings, and tags. All three can be joined by the movieId key.
Data Dictionary
movies
| variable | class | description |
|---|---|---|
| movieId | numeric | The unique id of the movie |
| title | character | The title of the movie |
| genres | character | The genres the movie can be categorized in |
ratings
Table with movies and their average rating. Movies that received less than two ratings were removed.
| variable | class | description |
|---|---|---|
| movieId | numeric | The unique id of the movie |
| avg_rating | numeric | Average rating received |
tags
| variable | class | description |
|---|---|---|
| userId | numeric | A unique identifier for the user that gave the rating |
| movieId | numeric | The unique id of the movie |
| tag | numeric | The tag that was given to the movie |
| timestamp | numeric | The timestamp on which the user gave the rating |
Load your Data
# ratings = s3read_using(FUN = read.csv, bucket = "datacamp-workspacedemo-workspacedemos3-prod", object = "lca-rec-sys/ratings_by_movie.csv")
# movies = s3read_using(FUN = read.csv, bucket = "datacamp-workspacedemo-workspacedemos3-prod", object = "lca-rec-sys/movies.csv")
# tags = s3read_using(FUN = read.csv, bucket = "datacamp-workspacedemo-workspacedemos3-prod", object = "lca-rec-sys/tags.csv")Data preprocessing
Create two tables:
- Matrix for recommender model (only numeric values)
- Cleaned, fully joined table with all data for final output
1. Split genres to one genre per column per movie, only keep numeric values
2. Add average rating to movies, filter out movies without rating
3. Prepare dataset for recommender engine as matrix
4. Retrieve full list of genres as a vector
5. Retrieve top 15 of movie tags to filter out rarely used tags