Skip to content

Code-along 2025-01-15 Build an AI Movie Night Recommendation Tool

In this code-along, we will be building an AI Movie Night Recommendation Tool!

To do this, we will be using two data sets:

  • Movies metadata: A data set containing metadata of about 9000 movies (title, description, etc.)
  • User ratings: A data set containing ratings of how much someone liked a movie.

We will be building towards our end goal by covering the following tasks:

  • Understanding the data set by doing some basic exploratory analysis
  • Building a first recommender based on movie popularity or movie ratings
  • Personalising recommendations by exploiting user ratings
  • Leveraging LLMs to calculate similarity between movies
  • Generating a recommendation by writing what kind of movies you'd like to see
  • Putting it all together into one single recommendation tool

This code-along is aimed at anyone just starting to code by showing how you can build something useful by simply writing prompts to analyse data sets. The code generated is sufficiently challenging, however, for the more seasoned data person to play around with.

Task 1: Import the ratings and movie metadata and explore it.

The data is contained in two CSV files named movies_metadata.csv and ratings.csv

movies_metadata contains the following columns:

  • movie_id: Unique identifier of each movie.
  • title: Title of the movie.
  • overview: Short description of the movie.
  • vote_average: Average score the movie got.
  • vote_count: Total number of votes the movie got.

ratings contains the following columns:

  • user_id: Unique identifier of the person who rated the movie.
  • movie_id: Unique identifier of the movie.
  • rating: Value between 0 and 10 indicating how much the person liked the movie.

Prompt

Read the movies_metadata file and count how many unique movies there are, visualise the vote_average column and visualise the vote_count column. Next read the ratings file, and count how many unique users have rated how many unique movies. Visualise the distribution of the rating column.

import pandas as pd
import matplotlib.pyplot as plt
movies_metadata= pd.read_csv('movies_metadata.csv')
print("Movies:",movies_metadata.shape)

ratings= pd.read_csv("ratings.csv")
print("Ratings:",ratings.shape)

unique_movies= movies_metadata['movie_id'].nunique()

titles_movies= movies_metadata['title'].nunique()

print(movies_metadata.columns)
print("Unique:", unique_movies)
print("Titles:", titles_movies)
fig, ax= plt.subplots(figsize=(10,5))
plt.hist(movies_metadata['vote_average'].dropna(), bins=20, edgecolor='k', alpha=0.7)
plt.title("Distribution of vote average")
plt.xlabel("Vote Average")
plt.ylabel('Frequency')
plt.grid(True)
plt.show()

plt.hist(movies_metadata['vote_count'].dropna(),bins=20, edgecolor='k',alpha=0.7)
plt.title("Distribution of vote Count")
plt.xlabel("Vote Count")
plt.ylabel('Frequency')
plt.grid(True)
plt.show()

ratings['rating'].unique()
unique_users= ratings['user_id'].nunique()
unique_ratings_movies= ratings['movie_id'].nunique()
print("unique_users:", unique_users)
print("unique_ratings_movies:",unique_ratings_movies)

plt.hist(ratings['rating'].dropna(),bins=10, edgecolor='k',alpha=0.7)
plt.title("Distribution of Ratings")
plt.xlabel("Rating")
plt.ylabel('Frequency')
plt.grid(True)
plt.show()

Task 2: Simple recommender based on popularity or highest rating

In short, a recommender is any system that generates suggestions for an end user. We will start with creating the simplest recommender, one that ranks all movies according to the highest average score, or the highest number of votes.

This kind of recommender generates the same output for anyone using it.

Prompt

Based on movies_metadata, generate a simple recommender that generates recommended movies by either the vote_average or the vote_count. The recommender should be configurable in how many movies it recommends and based on which criterion.

def simple_recommender(movies_metadata, criterion='vote_average', top_n=10):
    if criterion not in ['vote_average','vote_count']:
        raise ValueError("Criterion must either be 'vote_average' or 'vote_count'")
   

    recommended_movies=movies_metadata.sort_values(by=criterion, ascending=False)
    recommended_movies=recommended_movies[['movie_id','title','overview',criterion]].head(top_n)

    return recommended_movies


Highest_count= simple_recommender(movies_metadata, criterion='vote_count',top_n=10)

Highest_average= simple_recommender(movies_metadata, criterion='vote_average',top_n=10)

Highest_count

Highest_average

Task 3: Generate recommendations based on user ratings

We already created a very simple first recommender, but we haven't touched our user data yet! How can this help us? When you watched a movie you liked, you might want to learn which other movies other users liked that also watched that movie. This is where the user data comes in. We can use the ratings to infer which movies are similar to a movie you have already watched!

Prompt

Create a recommender that uses the ratings data and generates movie recommendations when you put in a specific movie title.

from sklearn.metrics.pairwise import cosine_similarity
def create_user_based_recommender(movies_metadata, ratings,movie_title,top_n=10):
    movies_ratings=pd.merge(ratings, movies_metadata, on='movie_id')

    user_movie_matrix= movies_ratings.pivot_table(index='user_id',columns='title', values='rating' ) 
    user_movie_matrix.fillna(0, inplace=True)

    movie_similarity=cosine_similarity(user_movie_matrix.T)

    movie_similarity_df=pd.DataFrame(movie_similarity,          index=user_movie_matrix.columns,columns=user_movie_matrix.columns)

    similar_movies= movie_similarity_df[movie_title].sort_values(ascending=False)[1:top_n+1]

    return similar_movies

movie_title= 'The Godfather'
recommended_movies= create_user_based_recommender(movies_metadata, ratings, movie_title,top_n=10)
recommended_movies

Task 4: Generate embeddings based on the movie descriptions