Skip to content
0

Predicting Christmas Movie Grossings

๐Ÿ“– Executive Summary

The data sets provide us with a century's Christmas movie's data, top one thousand best rated movie's data and budgets of the movies. We observed the data sets to figure out what makes the movies successful. Here's a summary of our approach and findings:

  1. Pre-processing of the Data Sets: The dataset that contains production budgets, had been merged with both of the xmas_movies and top1k_movies dataset. The first challenge was to handle the bulk amount of null contents specially in the gross and budget columns. We implemented predictive model to fill them based on other available features.

  2. Exploratory Analysis: We analysed the data based on different given features like genres, movie stars, directors and grossings. Some useful plots and charts have been added for better visualization.

  3. Extraction of New Ideas: We added some new features to our data, like how diverse the movie's genres are, how long a movie's title is or how popular the director is. This helps us understand what makes a movie do well.

  4. Words Matter: We used word clouds to visualize the frequency of the words from the movie titles and descriptions. This way, we could figure out how unique the new upcoming Christmas movie is.

  5. Model development Evaluation: After evaluating all the necessary features of the previous movies, we compared them with the features of our upcoming movie. These helped to make a guess about the success of the movie. And finally we employed a random forest regressor to predict the success of the movie using the available features. We evaluated the model's accuracy by plotting the actual vs predicted output. And we have used the Mean Squared Error to look at how good the prediction is.

๐Ÿ’พ The data

We're providing you with a dataset of 788 Christmas movies, with the following columns:

  • christmas_movies.csv
VariableDescription
titlethe title of the movie
release_yearyear the movie was released
descriptionshort description of the movie
typethe type of production e.g. Movie, TV Episode
ratingthe rating/certificate e.g. PG
runtimethe movie runtime in minutes
imdb_ratingthe IMDB rating
genrelist of genres e.g. Comedy, Drama etc.
directorthe director of the movie
starslist of actors in the movie
grossthe domestic gross of the movie in US dollars (what we want to predict)

You may also use an additional dataset of 1000 high-rated movies, with the following columns:

  • imdb_top1k.csv
VariableDescription
titlethe title of the movie
release_yearyear the movie was released
descriptionshort description of the movie
typethe type of production e.g. Movie, TV Episode
ratingthe ratig/certificate e.g. PG
runtimethe movie runtime in minutes
imdb_ratingthe IMDB rating
genrelist of genres e.g. Comedy, Drama etc.
directorthe director of the movie
starslist of actors in the movie
grossthe domestic gross of the movie in US dollars (what we want to predict)

Finally you have access to a dataset of movie production budgets for over 6,000 movies, with the following columns:

  • movie_budgets.csv
VariableMeaning
yearyear the movie was released
datedate the movie was released
titletitle of the movie
production budgetproduction budget in US dollars

Note: while you may augment the Christmas movies with the general movie data, the model should be developed to predict ratings of Christmas movies only.

# Importing the necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from wordcloud import WordCloud, STOPWORDS
xmas_movies = pd.read_csv('data/christmas_movies.csv')
xmas_movies
top1k_movies = pd.read_csv('data/imdb_top1k.csv')
top1k_movies
movie_budgets = pd.read_csv('data/movie_budgets.csv')
movie_budgets

๐Ÿ’ช Competition challenge

Create a report that covers the following:

  1. Exploratory data analysis of the dataset with informative plots. It's up to you what to include here! Some ideas could include:

    • Analysis of the genres
    • Descriptive statistics and histograms of the grossings
    • Word clouds
  2. Develop a model to predict the movie's domestic gross based on the available features.

    • Remember to preprocess and clean the data first.
    • Think about what features you could define (feature engineering), e.g.:
      • number of times a director appeared in the top 1000 movies list,
      • highest grossing for lead actor(s),
      • decade released
  3. Evaluate your model using appropriate metrics.

  4. Explain some of the limitations of the models you have developed. What other data might help improve the model?

  5. Use your model to predict the grossing of the following fictitious Christmas movie:

Title: The Magic of Bellmonte Lane

Description: "The Magic of Bellmonte Lane" is a heartwarming tale set in the charming town of Bellmonte, where Christmas isn't just a holiday, but a season of magic. The story follows Emily, who inherits her grandmother's mystical bookshop. There, she discovers an enchanted book that grants Christmas wishes. As Emily helps the townspeople, she fights to save the shop from a corporate developer, rediscovering the true spirit of Christmas along the way. This family-friendly film blends romance, fantasy, and holiday cheer in a story about community, hope, and magic.

Director: Greta Gerwig

Cast:

  • Emma Thompson as Emily, a kind-hearted and curious woman
  • Ian McKellen as Mr. Grayson, the stern corporate developer
  • Tom Hanks as George, the wise and elderly owner of the local cafe
  • Zoe Saldana as Sarah, Emily's supportive best friend
  • Jacob Tremblay as Timmy, a young boy with a special Christmas wish

Runtime: 105 minutes

Genres: Family, Fantasy, Romance, Holiday

Production budget: $25M

๐Ÿ•ต๏ธโ€โ™‚๏ธ Dataset Overview

Brief information about the data sets.

def info_df(df):
    data = []
    for column in df.columns:
        data.append({'Column_Name': column, 'Data_Type': df[column].dtype, 'Non-Null_Count': df[column].count(), 'Null_Count': df[column].isna().sum(), 'Percentage_NA': (df[column].isna().mean())*100, 'Unique_Values_Count': df[column].nunique()})
    result_df = pd.DataFrame(data)
    return result_df
display(info_df(xmas_movies))
display(info_df(top1k_movies))
display(info_df(movie_budgets))

Some numeric columns like gross in both xmas_movies and top1k_movies datasets are given in object type. We need to clean these columns and modefy them to convert into float type data. Again we have a big proportion of missing values in each dataset that must be considered. However, I will start by joinning the movie_budget dataset with xmas_movies and top1k_movies datasets because production budget is likely to be an importat feature.

โ€Œ
โ€Œ
โ€Œ