Skip to content

Predicting Christmas Movie Grossings

📖 Background

Imagine harnessing the power of data science to unveil the hidden potential of movies before they even hit the silver screen! As a data scientist at a forward-thinking cinema, you're at the forefront of an exhilarating challenge: crafting a cutting-edge system that doesn't just predict movie revenues, but reshapes the entire landscape of cinema profitability. This isn't just about numbers; it's about blending art with analytics to revolutionize how movies are marketed, chosen, and celebrated.

Your mission? To architect a predictive model that dives deep into the essence of a movie - from its title and running time to its genre, captivating description, and star-studded cast. And what better way to sprinkle some festive magic on this project than by focusing on a dataset brimming with Christmas movies? A highly-anticipated Christmas movie is due to launch soon, but the cinema has some doubts. It wants you to predict its success, so it can decide whether to go ahead with the screening or not. It's a unique opportunity to blend the cheer of the holiday season with the rigor of data science, creating insights that could guide the success of tomorrow's blockbusters. Ready to embark on this cinematic adventure?

💾 The data

We're providing you with a dataset of 788 Christmas movies, with the following columns:

  • christmas_movies.csv
VariableDescription
titlethe title of the movie
release_yearyear the movie was released
descriptionshort description of the movie
typethe type of production e.g. Movie, TV Episode
ratingthe rating/certificate e.g. PG
runtimethe movie runtime in minutes
imdb_ratingthe IMDB rating
genrelist of genres e.g. Comedy, Drama etc.
directorthe director of the movie
starslist of actors in the movie
grossthe domestic gross of the movie in US dollars (what we want to predict)

You may also use an additional dataset of 1000 high-rated movies, with the following columns:

  • imdb_top1k.csv
VariableDescription
titlethe title of the movie
release_yearyear the movie was released
descriptionshort description of the movie
typethe type of production e.g. Movie, TV Episode
ratingthe ratig/certificate e.g. PG
runtimethe movie runtime in minutes
imdb_ratingthe IMDB rating
genrelist of genres e.g. Comedy, Drama etc.
directorthe director of the movie
starslist of actors in the movie
grossthe domestic gross of the movie in US dollars (what we want to predict)

Finally you have access to a dataset of movie production budgets for over 6,000 movies, with the following columns:

  • movie_budgets.csv
VariableMeaning
yearyear the movie was released
datedate the movie was released
titletitle of the movie
production budgetproduction budget in US dollars

Note: while you may augment the Christmas movies with the general movie data, the model should be developed to predict ratings of Christmas movies only.

#This command installs the latest version of the XGBoost library, which is a popular and efficient open-source machine learning library specifically designed for gradient boosting.

!pip install xgboost
#This command installs the Altair Viewer, an extension for the Altair library that provides an interactive viewer for Altair visualizations in Jupyter Notebooks.

!pip install altair_viewer
#This command upgrades your existing PyArrow installation to the latest version available. Upgrading is useful to access new features, improvements, and bug fixes introduced in newer releases of the library.

!pip install --upgrade pyarrow
#This command installs the Prince library, which is a Python implementation of multiple factor analysis (MFA) and other dimensionality reduction techniques for mixed-type data.

!pip install prince
# Import necessary libraries for data manipulation, visualization, and machine learning preprocessing
import pandas as pd  # Pandas for data manipulation
import numpy as np  # NumPy for numerical operations
import matplotlib.pyplot as plt  # Matplotlib for plotting
from matplotlib import gridspec  # Gridspec for customized subplot layouts
import seaborn as sns  # Seaborn for statistical data visualization
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator  # WordCloud for generating word clouds
from PIL import Image  # Pillow library for working with images
from sklearn.impute import KNNImputer  # KNNImputer for imputing missing values using k-Nearest Neighbors
from sklearn.preprocessing import StandardScaler  # StandardScaler for standardizing features in machine learning
from sklearn.feature_selection import RFE  # Recursive Feature Elimination for feature selection
from sklearn.model_selection import train_test_split  # Splitting data into training and testing sets
from sklearn.linear_model import LinearRegression  # Linear Regression model for predictive modeling
from sklearn.metrics import mean_squared_error  # Mean Squared Error for model evaluation
from sklearn.preprocessing import LabelEncoder  # LabelEncoder for encoding categorical variables
# Import the Principal Component Analysis (PCA) class from scikit-learn
from sklearn.decomposition import PCA
# Import the Multiple Correspondence Analysis (MCA) class from the prince library
from prince import MCA
# Import the Altair library for interactive visualizations
import altair as alt
# Import the prince library for multivariate analysis
import prince

from sklearn.ensemble import RandomForestRegressor  # RandomForestRegressor for predictive modeling
from xgboost import XGBRegressor  # XGBoost Regressor for predictive modeling
from sklearn.tree import plot_tree  # plot_tree for visualizing decision trees
import statsmodels.api as sm  # Statsmodels for statistical analysis

# Import the Pipeline class from scikit-learn for constructing a machine learning pipeline
from sklearn.pipeline import Pipeline

# Import the ColumnTransformer class for applying different transformations to different subsets of columns
from sklearn.compose import ColumnTransformer

# Import StandardScaler for scaling numerical features to have zero mean and unit variance
from sklearn.preprocessing import StandardScaler

# Import OneHotEncoder for one-hot encoding categorical features
from sklearn.preprocessing import OneHotEncoder

# Import train_test_split for splitting the dataset into training and testing sets
from sklearn.model_selection import train_test_split

# Import r2_score and mean_squared_error for evaluating regression model performance
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Import SimpleImputer for handling missing values in the dataset
from sklearn.impute import SimpleImputer

# Import necessary libraries
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score



xmas_movies = pd.read_csv('data/christmas_movies.csv')
xmas_movies
top1k_movies = pd.read_csv('data/imdb_top1k.csv')
top1k_movies
movie_budgets = pd.read_csv('data/movie_budgets.csv')
movie_budgets

💪 Competition challenge

Create a report that covers the following:

  1. Exploratory data analysis of the dataset with informative plots. It's up to you what to include here! Some ideas could include:
    • Analysis of the genres
    • Descriptive statistics and histograms of the grossings
    • Word clouds
  2. Develop a model to predict the movie's domestic gross based on the available features.
    • Remember to preprocess and clean the data first.
    • Think about what features you could define (feature engineering), e.g.:
      • number of times a director appeared in the top 1000 movies list,
      • highest grossing for lead actor(s),
      • decade released
  3. Evaluate your model using appropriate metrics.
  4. Explain some of the limitations of the models you have developed. What other data might help improve the model?
  5. Use your model to predict the grossing of the following fictitious Christmas movie:

Title: The Magic of Bellmonte Lane

Description: "The Magic of Bellmonte Lane" is a heartwarming tale set in the charming town of Bellmonte, where Christmas isn't just a holiday, but a season of magic. The story follows Emily, who inherits her grandmother's mystical bookshop. There, she discovers an enchanted book that grants Christmas wishes. As Emily helps the townspeople, she fights to save the shop from a corporate developer, rediscovering the true spirit of Christmas along the way. This family-friendly film blends romance, fantasy, and holiday cheer in a story about community, hope, and magic.

Director: Greta Gerwig

Cast:

  • Emma Thompson as Emily, a kind-hearted and curious woman
  • Ian McKellen as Mr. Grayson, the stern corporate developer
  • Tom Hanks as George, the wise and elderly owner of the local cafe
  • Zoe Saldana as Sarah, Emily's supportive best friend
  • Jacob Tremblay as Timmy, a young boy with a special Christmas wish

Runtime: 105 minutes

Genres: Family, Fantasy, Romance, Holiday

Production budget: $25M

Executive Summary:

This project, at the intersection of analytics and the film industry, harnessed a multifaceted approach encompassing descriptive, predictive, and prescriptive analytics. Leveraging advanced methodologies, including Principal Component Analysis (PCA), Multiple Correspondence Analysis (MCA), and predictive models like Random Forest and XGBoost, we sought to unravel patterns influencing movie total gross.

Recommendations:

- Holistic Descriptive Analytics: Augment descriptive analytics to comprehensively capture historical trends, industry patterns, and key performance indicators. Utilize visualization tools to present insights in an accessible manner.

- Diversification of Predictive Models: Expand the repertoire of predictive models beyond Random Forest to include advanced algorithms like XGBoost. This diversification enhances the modeling toolkit, accommodating varied data structures and improving predictive accuracy.

- Prescriptive Analytics Implementation: Transition towards prescriptive analytics to provide actionable recommendations. Develop strategies that guide decision-making for stakeholders in the film industry, optimizing outcomes based on predictive insights.

- Advanced Dimensionality Reduction Techniques: Continue exploration of advanced dimensionality reduction methods, specifically PCA and MCA. Uncover latent structures within the dataset to streamline features and enhance model interpretability.

- Holistic Model Validation: Implement rigorous cross-validation techniques to ensure the robustness and generalizability of predictive models. This practice is crucial for validating model performance across diverse datasets and real-world scenarios.

These recommendations collectively aim to fortify the analytical framework, equipping the film industry with more sophisticated tools for decision-making. The integration of these approaches will not only enhance predictive accuracy but also empower stakeholders with actionable insights, fostering a new era of strategic planning and efficiency.

Conducting exploratory data analysis on the dataset using insightful visualizations.

Data Exploration through Informative Visualizations: A Comprehensive Analysis of the Dataset