Skip to content
Competition - Christmas movies analysis
0
  • AI Chat
  • Code
  • Report
  • Predicting Christmas Movie Grossings

    📖 Background

    Imagine harnessing the power of data science to unveil the hidden potential of movies before they even hit the silver screen! As a data scientist at a forward-thinking cinema, you're at the forefront of an exhilarating challenge: crafting a cutting-edge system that doesn't just predict movie revenues, but reshapes the entire landscape of cinema profitability. This isn't just about numbers; it's about blending art with analytics to revolutionize how movies are marketed, chosen, and celebrated.

    Your mission? To architect a predictive model that dives deep into the essence of a movie - from its title and running time to its genre, captivating description, and star-studded cast. And what better way to sprinkle some festive magic on this project than by focusing on a dataset brimming with Christmas movies? A highly-anticipated Christmas movie is due to launch soon, but the cinema has some doubts. It wants you to predict its success, so it can decide whether to go ahead with the screening or not. It's a unique opportunity to blend the cheer of the holiday season with the rigor of data science, creating insights that could guide the success of tomorrow's blockbusters. Ready to embark on this cinematic adventure?

    💾 The data

    We're providing you with a dataset of 788 Christmas movies, with the following columns:

    • christmas_movies.csv
    VariableDescription
    titlethe title of the movie
    release_yearyear the movie was released
    descriptionshort description of the movie
    typethe type of production e.g. Movie, TV Episode
    ratingthe rating/certificate e.g. PG
    runtimethe movie runtime in minutes
    imdb_ratingthe IMDB rating
    genrelist of genres e.g. Comedy, Drama etc.
    directorthe director of the movie
    starslist of actors in the movie
    grossthe domestic gross of the movie in US dollars (what we want to predict)

    You may also use an additional dataset of 1000 high-rated movies, with the following columns:

    • imdb_top1k.csv
    VariableDescription
    titlethe title of the movie
    release_yearyear the movie was released
    descriptionshort description of the movie
    typethe type of production e.g. Movie, TV Episode
    ratingthe ratig/certificate e.g. PG
    runtimethe movie runtime in minutes
    imdb_ratingthe IMDB rating
    genrelist of genres e.g. Comedy, Drama etc.
    directorthe director of the movie
    starslist of actors in the movie
    grossthe domestic gross of the movie in US dollars (what we want to predict)

    Finally you have access to a dataset of movie production budgets for over 6,000 movies, with the following columns:

    • movie_budgets.csv
    VariableMeaning
    yearyear the movie was released
    datedate the movie was released
    titletitle of the movie
    production budgetproduction budget in US dollars

    Note: while you may augment the Christmas movies with the general movie data, the model should be developed to predict ratings of Christmas movies only.

    import pandas as pd
    xmas_movies = pd.read_csv('data/christmas_movies.csv')
    xmas_movies
    top1k_movies = pd.read_csv('data/imdb_top1k.csv')
    top1k_movies
    movie_budgets = pd.read_csv('data/movie_budgets.csv')
    movie_budgets

    💪 Competition challenge

    Create a report that covers the following:

    1. Exploratory data analysis of the dataset with informative plots. It's up to you what to include here! Some ideas could include:
      • Analysis of the genres
      • Descriptive statistics and histograms of the grossings
      • Word clouds
    2. Develop a model to predict the movie's domestic gross based on the available features.
      • Remember to preprocess and clean the data first.
      • Think about what features you could define (feature engineering), e.g.:
        • number of times a director appeared in the top 1000 movies list,
        • highest grossing for lead actor(s),
        • decade released
    3. Evaluate your model using appropriate metrics.
    4. Explain some of the limitations of the models you have developed. What other data might help improve the model?
    5. Use your model to predict the grossing of the following fictitious Christmas movie:

    Title: The Magic of Bellmonte Lane

    Description: "The Magic of Bellmonte Lane" is a heartwarming tale set in the charming town of Bellmonte, where Christmas isn't just a holiday, but a season of magic. The story follows Emily, who inherits her grandmother's mystical bookshop. There, she discovers an enchanted book that grants Christmas wishes. As Emily helps the townspeople, she fights to save the shop from a corporate developer, rediscovering the true spirit of Christmas along the way. This family-friendly film blends romance, fantasy, and holiday cheer in a story about community, hope, and magic.

    Director: Greta Gerwig

    Cast:

    • Emma Thompson as Emily, a kind-hearted and curious woman
    • Ian McKellen as Mr. Grayson, the stern corporate developer
    • Tom Hanks as George, the wise and elderly owner of the local cafe
    • Zoe Saldana as Sarah, Emily's supportive best friend
    • Jacob Tremblay as Timmy, a young boy with a special Christmas wish

    Runtime: 105 minutes

    Genres: Family, Fantasy, Romance, Holiday

    Production budget: $25M

    ✅ Checklist before publishing into the competition

    • Rename your workspace to make it descriptive of your work. N.B. you should leave the notebook name as notebook.ipynb.
    • Remove redundant cells like the judging criteria, so the workbook is focused on your story.
    • Make sure the workbook reads well and explains how you found your insights.
    • Try to include an executive summary of your recommendations at the beginning.
    • Check that all the cells run without error

    ⌛️ Time is ticking. Good luck!

    📑 Executive Summary

    This report provides an analysis of Christmas movies, including their release year, genre, production budget, and gross revenue. The analysis also includes a correlation matrix to identify relationships between variables, as well as a dataset of movie budgets and top 1000 highest-rated movies. The findings and insights from this analysis will be summarized in the following section:

    • Genres: Comedy, Drama and Romance were the most common, this means 'The Magic of Bellmonte Lane' would have a higher chance of earning a larger gross, because the family (joint with other genres) was one of the genres with the highest gross. Some genres, like action and adventure, would have a higher gross than other genres.
    • Production budget: Looking at our scatter graph for production budgets and grossings of Christmas movies, there was a positive correlation between gross and production budget.
    • Release year: I did a little on release year, and there was no visible linear correlation but there was a numeric correlation of 0.37, probably an anomoly.
    • Predictive variables: If I were to build a predictive machine learning model, the predictive variables I would choose (based on correlation and all the graphs below) genre, production budget, release year, directors and stars. Even though I did not get onto directors/stars, I would have explored them in detail because some Christmas movies can have the same directors/stars as in the highest rated movies in the world.

    In my opinion, using the scatter graph with the regplot (the line of best fit 'predicts') that the movie would earn a gross of around $50 million. If I would have more time (and knowledge), I would have also touched on the type of Christmas movies, and I would definitely tried to build a predicting model.

    🔍 Analysing the data

    Whenever using data I am not known with, I always check the structure, missing values, etc. This is what we will be doing with the Christmas Movies and the 1000 highest-rated movies tables in this part. Below, I check the data type of each column and the dimensionality of the xmas_movies.

    Note:

    • I will only be doing the exploratory data analysis part because I am a beginner in Python and I have not learnt data science yet. I will still try to complete this competition
    • When I write xmas_movies I am referring to the christmas movies table and when I write something along the lines of high rated movies I am referring to the 1000 high-rated movies
    # Import necessary libraries for project
    import numpy as np
    import pandas as pd
    from os import path
    from PIL import Image
    from wordcloud import WordCloud, STOPWORDS
    import warnings
    import matplotlib.pyplot as plt
    # Find out what the data types of our columns are
    print(xmas_movies.dtypes)
    print()
    # Find out the dimensionality
    print(xmas_movies.shape)

    Most of our columns are objects (objects can represent any type of data but they usually are strings) and the rest are floats (decimal numbers that can hold up to 64-bits values). I use the shape attribute of the DataFrame, as it returns a tuple containing the number of rows and columns.

    Thankfully, the description above, at the start of project, matches with our results - 788 rows of movies and 11 columns. In the cell below we check for missing values of each column.