Merry Metrics - Predicting Christmas Movie Grossings
Table of contents:
- Summary
- Introduction and Background
- Data and Challenge
- Imported libraries
- Functions
- Data Examination
- Data Cleaning
- Exploratory Analysis
- Feature Engineering
- Model selection, evaluation, prediction
Executive Summary: Predictive Modeling for Movie Grossing
Prediction for "The Magic of Bellmonte Lane":
Predicted Gross: $577.24 million
Evaluation Metrics: Test RMSE: $49.94 million, Test R-squared: 0.77
Machine learning model: Voting Regressor (Random Forest Regressor + Gradient Boosting Regressor)
The predicted gross is more than 10 times the production budget, and much more than the average gross of movies.
The new movie will be a blockbuster.
Preprocessing, Feature engineering and Predictive Modeling for handling missing values by imputation:
-
To improve the predictive models, a diverse set of features was engineered. These include features related to the stars, directors, genres, and language processing.
-
Applied Yeo-Johnson transformation and RobustScaler for preprocessing the input features. Used logaritmic transformation and RobustScaler for the target feature.
-
Utilized a fine-tuned Random Forest Regressor to predict gross, with a Test RMSE of 0.15 and R-squared of 0.94, and production budget Test R-squared of 0.95 and Mean Squared Error of 0.05.
Limitations and Future Improvements:
-
Missing Data: A significant percentage of gross and production budget data is missing, impacting prediction accuracy.
-
Additional Features: Incorporating features like production companies, countries, languages, and popularity scores may enhance predictions.
Exploratory Data Analysis:
Gross Statistics:
Average gross for Christmas movies:
Average gross for top 1k movies:
Star and Director Analysis:
- All featured stars appeared at least twice in the top 1000 films.
- The average income of the films in which the featured stars appeared exceeds the production budget of the new film.
- Tom Hanks is the most frequently featured star, appearing in 13 top films.
- Zoe Saldana starred in the top 3 highest-grossing movies with a maximum gross of $760.51 million.
- Greta Gerwig directed one film on the top list, grossing $108.10 million.
Genres Analysis:
- Family, Fantasy, and Romance are the top frequent genres for Christmas movies.
- Romance showing an increasing trend until the 2000s.
- Romance and Fantasy among the top 5 highest-grossing genres in the recent years.
Correlation and Insights:
- A strong positive correlation (0.76) exists between the gross and the production budget.
- There is a strong correlation between the gross and the average gross of films featuring the same stars or directed by the same director.
- No linear relationship between IMDB rating and gross, suggesting that critical acclaim does not always translate to higher grossing.
Conclusion:
- The predictive models show promise in estimating gross based on available features. Addressing missing data and incorporating additional features could further enhance model accuracy.
- Christmas movies, particularly in Romance and Fantasy genres, remain a robust industry trend.
- The success of the new Christmas movie is evident, with specific genres and stars consistently contributing to high-grossing films.
Introduction
In the enchanting realm of Christmas movies, where every frame exudes festive magic, have you ever found yourself completely immersed in the captivating tales that unfold on the silver screen? The joy of snuggling up in a cinema seat, surrounded by twinkling lights and the promise of heartwarming narratives, is a cherished tradition during the holiday season.
Have you ever wondered about the financial side behind the scenes? That magical movie that transports you into a winter wonderland - how much revenue does it generate at the box office?
In the dynamic realm of cinema, financial success depends on a delicate interplay of factors, ranging from onscreen chemistry to strategic release timing. Since the inception of the industry, the search for a winning formula has persisted. Can past movie data help identify patterns and create a predictive model for revenue?
In this project, we will explore the factors that contribute to the success of Christmas movies and develop a machine-learning model. Using metrics such as the R2 score and Root Mean Square Error, we evaluate the model's performance to demystify the art of predicting a movie's commercial success.
📖 Background
Imagine harnessing the power of data science to unveil the hidden potential of movies before they even hit the silver screen! As a data scientist at a forward-thinking cinema, you're at the forefront of an exhilarating challenge: crafting a cutting-edge system that doesn't just predict movie revenues, but reshapes the entire landscape of cinema profitability. This isn't just about numbers; it's about blending art with analytics to revolutionize how movies are marketed, chosen, and celebrated.
Your mission? To architect a predictive model that dives deep into the essence of a movie - from its title and running time to its genre, captivating description, and star-studded cast. And what better way to sprinkle some festive magic on this project than by focusing on a dataset brimming with Christmas movies? A highly-anticipated Christmas movie is due to launch soon, but the cinema has some doubts. It wants you to predict its success, so it can decide whether to go ahead with the screening or not. It's a unique opportunity to blend the cheer of the holiday season with the rigor of data science, creating insights that could guide the success of tomorrow's blockbusters. Ready to embark on this cinematic adventure?
💪 Competition challenge
Create a report that covers the following:
- Exploratory data analysis of the dataset with informative plots. It's up to you what to include here! Some ideas could include:
- Analysis of the genres
- Descriptive statistics and histograms of the grossings
- Word clouds
- Develop a model to predict the movie's domestic gross based on the available features.
- Remember to preprocess and clean the data first.
- Think about what features you could define (feature engineering), e.g.:
- number of times a director appeared in the top 1000 movies list,
- highest grossing for lead actor(s),
- decade released
- Evaluate your model using appropriate metrics.
- Explain some of the limitations of the models you have developed. What other data might help improve the model?
- Use your model to predict the grossing of the following fictitious Christmas movie:
Title: The Magic of Bellmonte Lane
Description: "The Magic of Bellmonte Lane" is a heartwarming tale set in the charming town of Bellmonte, where Christmas isn't just a holiday, but a season of magic. The story follows Emily, who inherits her grandmother's mystical bookshop. There, she discovers an enchanted book that grants Christmas wishes. As Emily helps the townspeople, she fights to save the shop from a corporate developer, rediscovering the true spirit of Christmas along the way. This family-friendly film blends romance, fantasy, and holiday cheer in a story about community, hope, and magic.
Director: Greta Gerwig
Cast:
- Emma Thompson as Emily, a kind-hearted and curious woman
- Ian McKellen as Mr. Grayson, the stern corporate developer
- Tom Hanks as George, the wise and elderly owner of the local cafe
- Zoe Saldana as Sarah, Emily's supportive best friend
- Jacob Tremblay as Timmy, a young boy with a special Christmas wish
Runtime: 105 minutes
Genres: Family, Fantasy, Romance, Holiday
Production budget: $25M
💾 The data
We're providing you with a dataset of 788 Christmas movies, with the following columns:
christmas_movies.csv
Variable | Description |
---|---|
title | the title of the movie |
release_year | year the movie was released |
description | short description of the movie |
type | the type of production e.g. Movie, TV Episode |
rating | the rating/certificate e.g. PG |
runtime | the movie runtime in minutes |
imdb_rating | the IMDB rating |
genre | list of genres e.g. Comedy, Drama etc. |
director | the director of the movie |
stars | list of actors in the movie |
gross | the domestic gross of the movie in US dollars (what we want to predict) |
You may also use an additional dataset of 1000 high-rated movies, with the following columns:
imdb_top1k.csv
Variable | Description |
---|---|
title | the title of the movie |
release_year | year the movie was released |
description | short description of the movie |
type | the type of production e.g. Movie, TV Episode |
rating | the ratig/certificate e.g. PG |
runtime | the movie runtime in minutes |
imdb_rating | the IMDB rating |
genre | list of genres e.g. Comedy, Drama etc. |
director | the director of the movie |
stars | list of actors in the movie |
gross | the domestic gross of the movie in US dollars (what we want to predict) |
Finally you have access to a dataset of movie production budgets for over 6,000 movies, with the following columns:
movie_budgets.csv
Variable | Meaning |
---|---|
year | year the movie was released |
date | date the movie was released |
title | title of the movie |
production budget | production budget in US dollars |
Note: while you may augment the Christmas movies with the general movie data, the model should be developed to predict ratings of Christmas movies only.
import pandas as pd
xmas_movies = pd.read_csv('data/christmas_movies.csv')
xmas_movies
‌
‌