Skip to content
Competition - Christmas movie grossings
0
  • AI Chat
  • Code
  • Report
  • Predicting Christmas Movie Grossings

    📖 Background

    Imagine harnessing the power of data science to unveil the hidden potential of movies before they even hit the silver screen! As a data scientist at a forward-thinking cinema, you're at the forefront of an exhilarating challenge: crafting a cutting-edge system that doesn't just predict movie revenues, but reshapes the entire landscape of cinema profitability. This isn't just about numbers; it's about blending art with analytics to revolutionize how movies are marketed, chosen, and celebrated.

    Your mission? To architect a predictive model that dives deep into the essence of a movie - from its title and running time to its genre, captivating description, and star-studded cast. And what better way to sprinkle some festive magic on this project than by focusing on a dataset brimming with Christmas movies? A highly-anticipated Christmas movie is due to launch soon, but the cinema has some doubts. It wants you to predict its success, so it can decide whether to go ahead with the screening or not. It's a unique opportunity to blend the cheer of the holiday season with the rigor of data science, creating insights that could guide the success of tomorrow's blockbusters. Ready to embark on this cinematic adventure?

    💾 The data

    We're providing you with a dataset of 788 Christmas movies, with the following columns:

    • christmas_movies.csv
    VariableDescription
    titlethe title of the movie
    release_yearyear the movie was released
    descriptionshort description of the movie
    typethe type of production e.g. Movie, TV Episode
    ratingthe rating/certificate e.g. PG
    runtimethe movie runtime in minutes
    imdb_ratingthe IMDB rating
    genrelist of genres e.g. Comedy, Drama etc.
    directorthe director of the movie
    starslist of actors in the movie
    grossthe domestic gross of the movie in US dollars (what we want to predict)

    You may also use an additional dataset of 1000 high-rated movies, with the following columns:

    • imdb_top1k.csv
    VariableDescription
    titlethe title of the movie
    release_yearyear the movie was released
    descriptionshort description of the movie
    typethe type of production e.g. Movie, TV Episode
    ratingthe rating/certificate e.g. PG
    runtimethe movie runtime in minutes
    imdb_ratingthe IMDB rating
    genrelist of genres e.g. Comedy, Drama etc.
    directorthe director of the movie
    starslist of actors in the movie
    grossthe domestic gross of the movie in US dollars (what we want to predict)

    Finally you have access to a dataset of movie production budgets for over 6,000 movies, with the following columns:

    • movie_budgets.csv
    VariableMeaning
    yearyear the movie was released
    datedate the movie was released
    titletitle of the movie
    production budgetproduction budget in US dollars

    Note: while you may augment the Christmas movies with the general movie data, the model should be developed to predict ratings of Christmas movies only.

    Dealing With Missing Values

    We observe from the summary count of null values that their are 707 rows that are missing gross earnings reported. this is a serious issue for our model, which has gross as its response variable.

    We cannot interpolate values that will be used to train, unless we can confidently know the nature and spread of the distribution of all expected gross values.

    Hence, we choose to augment the xmas_movies data with the top1k_movies data, so that we can train a general predictor to estimate gross earnings for any movie.

    In order to distinguish whether the movies are xmas_movies, we'll add Christmas as a genre option, so that when we one-hot encode that column, xmas_movies will have a binary identifier. In this way we'll be able to (at least partially) predict success based on that characteristic.

    💪 Competition challenge

    Create a report that covers the following:

    1. Exploratory data analysis of the dataset with informative plots. It's up to you what to include here! Some ideas could include:
      • Analysis of the genres
      • Descriptive statistics and histograms of the grossings
      • Word clouds
    2. Develop a model to predict the movie's domestic gross based on the available features.
      • Remember to preprocess and clean the data first.
      • Think about what features you could define (feature engineering), e.g.:
        • number of times a director appeared in the top 1000 movies list,
        • highest grossing for lead actor(s),
        • decade released
    3. Evaluate your model using appropriate metrics.
    4. Explain some of the limitations of the models you have developed. What other data might help improve the model?
    5. Use your model to predict the grossing of the following fictitious Christmas movie:

    Title: The Magic of Bellmonte Lane

    Description: "The Magic of Bellmonte Lane" is a heartwarming tale set in the charming town of Bellmonte, where Christmas isn't just a holiday, but a season of magic. The story follows Emily, who inherits her grandmother's mystical bookshop. There, she discovers an enchanted book that grants Christmas wishes. As Emily helps the townspeople, she fights to save the shop from a corporate developer, rediscovering the true spirit of Christmas along the way. This family-friendly film blends romance, fantasy, and holiday cheer in a story about community, hope, and magic.

    Director: Greta Gerwig

    Cast:

    • Emma Thompson as Emily, a kind-hearted and curious woman
    • Ian McKellen as Mr. Grayson, the stern corporate developer
    • Tom Hanks as George, the wise and elderly owner of the local cafe
    • Zoe Saldana as Sarah, Emily's supportive best friend
    • Jacob Tremblay as Timmy, a young boy with a special Christmas wish

    Runtime: 105 minutes

    Genres: Family, Fantasy, Romance, Holiday

    Production budget: $25M

    Notes on use of BLS CPI data

    Given that we are evaluating 'success' in terms of box office gross, we will apply adjustment for inflation to our response variable prior to training the model. This should help mitigate for the effects of changing prices over time and allow us to fairly evaluate effects on gross across time.

    Because this is a significant tranformation, we will be sure to convert predictions to actual expected output prior

    The data are taken from the Bureau of Labor Statistics.

    Metadata and details below:

    From 1913 to 2023

    Data extracted on: January 14, 2024 (6:26:11 PM)

    Consumer Price Index for All Urban Consumers (CPI-U)

    Series Id: CUUR0000SA0

    Not Seasonally Adjusted

    Series Title: All items in U.S. city average, all urban consumers, not seasonally adjusted

    Area: U.S. city average

    Item: All items

    Base Period: 1982-84=100

    Comments on Gross by Year and Volume

    Based on the plot above, there initially appears to be a cluster of high-grossing films in the early to mid-2000's. Prior to that point, release volumes were climbing slowly, but afterward there is a marked increase in release volume, but no apparent corresponding boom in average gross.

    We are choosing to use the average of the gross here because we actually want to preserve outlier effects on the aggregate during initial exploration only. In this way we can survey for times that contain blockbusters, while also accounting for the effects of increased production volume and inflation on the overall profitability of the industry.

    So why are we looking at this?

    Because we're using time-series data to predict whether the current film will be a success in the current market, we must try our best to normalize performance data across time. This is a challenge, but can be accomplished if we normalize the data within these windows. That is, if we want to know whether a current release will perform well, but the model is biased to treat 'well' based off of the largest volume of data collected during the boom years, we may reject a release that could be quite profitable for current conditions.

    We will examine distributions across the years to see whether we are able to accurately normalize all the data to the parameters of a representative time.

    Comments on Adjusted Gross

    The distribution of 'gross' after adjusting for inflation appears sufficiently random, and the correlation coefficients are sufficiently low for us to determine that we should apply the inflation adjustment prior to training our predictive model.

    We understand that the output of this model would be an inflation-adjusted figure, but the transformation is invertible, so that we can easily convert the value back, based on the imported CPI data used.

    Feature Transformation and Encoding

    At this point, our dataset is clean and organized. Additionally, we have adjusted the response variable, gross, to account for the affect that inflation has on revenues over time. Now we can proceed to examining the remaining features and engineering new ones as relationships are discovered and/ or proposed.

    Next Steps
    1. Create a bag-of-words model that is TF-IDF weighted to determine imputation values for description;

    2. One-hot endcode 'title', 'genre', 'director', 'rating' and 'stars';

    3. Split the data into train and test sets prior to normalizing;

    4. Use robust normalization on 'runtime' and 'inf_adj_gross', as well as the imputed values for 'description';

    Construct a TF-IDF matrix from a text column in your DataFrame

    We'll wrap three functions into a method here to completely transform a

    Notes on Tokenizing and Text Frequency

    In order to effectively quantify the relationship between the text entries in description and the response variable, we must tokenize and partially lemmatize each description. We will use the TF-IDF method to weight each word and create a

    🧑‍⚖️ Judging criteria

    CATEGORYWEIGHTINGDETAILS
    Recommendations35%
    • Clarity of recommendations - how clear and well presented the recommendation is.
    • Quality of recommendations - are appropriate analytical techniques used & are the conclusions valid?
    • Number of relevant insights found for the target audience.
    Storytelling35%
    • How well the data and insights are connected to the recommendation.
    • How the narrative and whole report connects together.
    • Balancing making the report in-depth enough but also concise.
    Visualizations20%
    • Appropriateness of visualization used.
    • Clarity of insight from visualization.
    Votes10%
    • Up voting - most upvoted entries get the most points.

    ✅ Checklist before publishing into the competition

    • Rename your workspace to make it descriptive of your work. N.B. you should leave the notebook name as notebook.ipynb.
    • Remove redundant cells like the judging criteria, so the workbook is focused on your story.
    • Make sure the workbook reads well and explains how you found your insights.
    • Try to include an executive summary of your recommendations at the beginning.
    • Check that all the cells run without error

    ⌛️ Time is ticking. Good luck!