Skip to content
0

Christmas Movie Gross Predictor: Thanks, Tom Hanks!

A DALL·E 2 generated image of our movie's description. Movie poster, maybe?

Recommendation Sneak Peek and Project Scope

"The Magic of Bellmonte Lane" should receive a theatrical release! Our model predicts it will have a domestic gross box office earnings of ~$89M if released. It outperforms based on its budget of $25M, which is below the median Christmas movie budget of $31M. The low end of our prediction, $52M in gross earnings, still puts our movie above the median gross of $50M for other theatrically released Christmas movies and would make a net of $27M over our budget of $25M. At the high end, our movie could gross up to $127M, with a potential net earnings of $102M!

We will see how we come to this conclusion, how the data is analyzed, the features engineered, and the model developed below. Also, we'll quantify Tom Hanks's involvement in our movie and make a starpower calculator along the way. It is a long journey ahead, so let's get started!

First, we'll take a quick look at the competition scenario, data, and our hypothetical Christmas movie's details below.

(Please note that a lot of the commentary in this project refers to the code output, so doesn't make much sense in reader mode)

Predicting Christmas Movie Grossings

📖 Background

Imagine harnessing the power of data science to unveil the hidden potential of movies before they even hit the silver screen! As a data scientist at a forward-thinking cinema, you're at the forefront of an exhilarating challenge: crafting a cutting-edge system that doesn't just predict movie revenues, but reshapes the entire landscape of cinema profitability. This isn't just about numbers; it's about blending art with analytics to revolutionize how movies are marketed, chosen, and celebrated.

Your mission? To architect a predictive model that dives deep into the essence of a movie - from its title and running time to its genre, captivating description, and star-studded cast. And what better way to sprinkle some festive magic on this project than by focusing on a dataset brimming with Christmas movies? A highly-anticipated Christmas movie is due to launch soon, but the cinema has some doubts. It wants you to predict its success, so it can decide whether to go ahead with the screening or not. It's a unique opportunity to blend the cheer of the holiday season with the rigor of data science, creating insights that could guide the success of tomorrow's blockbusters. Ready to embark on this cinematic adventure?

The data

We're providing you with a dataset of 788 Christmas movies, with the following columns:

  • christmas_movies.csv
VariableDescription
titlethe title of the movie
release_yearyear the movie was released
descriptionshort description of the movie
typethe type of production e.g. Movie, TV Episode
ratingthe rating/certificate e.g. PG
runtimethe movie runtime in minutes
imdb_ratingthe IMDB rating
genrelist of genres e.g. Comedy, Drama etc.
directorthe director of the movie
starslist of actors in the movie
grossthe domestic gross of the movie in US dollars (what we want to predict)

You may also use an additional dataset of 1000 high-rated movies, with the following columns:

  • imdb_top1k.csv
VariableDescription
titlethe title of the movie
release_yearyear the movie was released
descriptionshort description of the movie
typethe type of production e.g. Movie, TV Episode
ratingthe ratig/certificate e.g. PG
runtimethe movie runtime in minutes
imdb_ratingthe IMDB rating
genrelist of genres e.g. Comedy, Drama etc.
directorthe director of the movie
starslist of actors in the movie
grossthe domestic gross of the movie in US dollars (what we want to predict)

Finally you have access to a dataset of movie production budgets for over 6,000 movies, with the following columns:

  • movie_budgets.csv
VariableMeaning
yearyear the movie was released
datedate the movie was released
titletitle of the movie
production budgetproduction budget in US dollars

Note: while you may augment the Christmas movies with the general movie data, the model should be developed to predict ratings of Christmas movies only.

Our Movie

Title: The Magic of Bellmonte Lane

Description: "The Magic of Bellmonte Lane" is a heartwarming tale set in the charming town of Bellmonte, where Christmas isn't just a holiday, but a season of magic. The story follows Emily, who inherits her grandmother's mystical bookshop. There, she discovers an enchanted book that grants Christmas wishes. As Emily helps the townspeople, she fights to save the shop from a corporate developer, rediscovering the true spirit of Christmas along the way. This family-friendly film blends romance, fantasy, and holiday cheer in a story about community, hope, and magic.

Director: Greta Gerwig

Cast:

  • Emma Thompson as Emily, a kind-hearted and curious woman
  • Ian McKellen as Mr. Grayson, the stern corporate developer
  • Tom Hanks as George, the wise and elderly owner of the local cafe
  • Zoe Saldana as Sarah, Emily's supportive best friend
  • Jacob Tremblay as Timmy, a young boy with a special Christmas wish

Runtime: 105 minutes

Genres: Family, Fantasy, Romance, Holiday

Production budget: $25M

Exploring the Target

Since our model is aiming to predict the domestic gross box office earnings of our movie, the gross feature of the xmas_movies dataset will be our target.

import pandas as pd
import numpy as np
xmas_movies = pd.read_csv('data/christmas_movies.csv')
xmas_movies.sample(25, random_state=0)

Lots of nulls in our target of gross! Looking at the rating feature, I would assume that the movies missing a gross are made for TV or did not make it the box office.

df = xmas_movies.copy()
print(df.gross.info())

Only ~10% of the movies have a gross value. Let's sadly get rid of the others for now and take a closer look at the remaining gross values.

df = df.dropna(subset='gross').reset_index(drop=True)