Predicting Christmas Movie Grossings
๐ Executive Summary
The data sets provide us with a century's Christmas movie's data, top one thousand best rated movie's data and budgets of the movies. We observed the data sets to figure out what makes the movies successful. Here's a summary of our approach and findings:
-
Pre-processing of the Data Sets: The dataset that contains production budgets, had been merged with both of the
xmas_moviesandtop1k_moviesdataset. The first challenge was to handle the bulk amount of null contents specially in the gross and budget columns. We implemented predictive model to fill them based on other available features. -
Exploratory Analysis: We analysed the data based on different given features like genres, movie stars, directors and grossings. Some useful plots and charts have been added for better visualization.
-
Extraction of New Ideas: We added some new features to our data, like how diverse the movie's genres are, how long a movie's title is or how popular the director is. This helps us understand what makes a movie do well.
-
Words Matter: We used word clouds to visualize the frequency of the words from the movie titles and descriptions. This way, we could figure out how unique the new upcoming Christmas movie is.
-
Model development Evaluation: After evaluating all the necessary features of the previous movies, we compared them with the features of our upcoming movie. These helped to make a guess about the success of the movie. And finally we employed a random forest regressor to predict the success of the movie using the available features. We evaluated the model's accuracy by plotting the actual vs predicted output. And we have used the Mean Squared Error to look at how good the prediction is.
๐พ The data
We're providing you with a dataset of 788 Christmas movies, with the following columns:
christmas_movies.csv
| Variable | Description |
|---|---|
title | the title of the movie |
release_year | year the movie was released |
description | short description of the movie |
type | the type of production e.g. Movie, TV Episode |
rating | the rating/certificate e.g. PG |
runtime | the movie runtime in minutes |
imdb_rating | the IMDB rating |
genre | list of genres e.g. Comedy, Drama etc. |
director | the director of the movie |
stars | list of actors in the movie |
gross | the domestic gross of the movie in US dollars (what we want to predict) |
You may also use an additional dataset of 1000 high-rated movies, with the following columns:
imdb_top1k.csv
| Variable | Description |
|---|---|
title | the title of the movie |
release_year | year the movie was released |
description | short description of the movie |
type | the type of production e.g. Movie, TV Episode |
rating | the ratig/certificate e.g. PG |
runtime | the movie runtime in minutes |
imdb_rating | the IMDB rating |
genre | list of genres e.g. Comedy, Drama etc. |
director | the director of the movie |
stars | list of actors in the movie |
gross | the domestic gross of the movie in US dollars (what we want to predict) |
Finally you have access to a dataset of movie production budgets for over 6,000 movies, with the following columns:
movie_budgets.csv
| Variable | Meaning |
|---|---|
year | year the movie was released |
date | date the movie was released |
title | title of the movie |
production budget | production budget in US dollars |
Note: while you may augment the Christmas movies with the general movie data, the model should be developed to predict ratings of Christmas movies only.
# Importing the necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from wordcloud import WordCloud, STOPWORDSxmas_movies = pd.read_csv('data/christmas_movies.csv')
xmas_moviestop1k_movies = pd.read_csv('data/imdb_top1k.csv')
top1k_moviesmovie_budgets = pd.read_csv('data/movie_budgets.csv')
movie_budgets๐ช Competition challenge
Create a report that covers the following:
-
Exploratory data analysis of the dataset with informative plots. It's up to you what to include here! Some ideas could include:
- Analysis of the genres
- Descriptive statistics and histograms of the grossings
- Word clouds
-
Develop a model to predict the movie's domestic gross based on the available features.
- Remember to preprocess and clean the data first.
- Think about what features you could define (feature engineering), e.g.:
- number of times a director appeared in the top 1000 movies list,
- highest grossing for lead actor(s),
- decade released
-
Evaluate your model using appropriate metrics.
-
Explain some of the limitations of the models you have developed. What other data might help improve the model?
-
Use your model to predict the grossing of the following fictitious Christmas movie:
Title: The Magic of Bellmonte Lane
Description: "The Magic of Bellmonte Lane" is a heartwarming tale set in the charming town of Bellmonte, where Christmas isn't just a holiday, but a season of magic. The story follows Emily, who inherits her grandmother's mystical bookshop. There, she discovers an enchanted book that grants Christmas wishes. As Emily helps the townspeople, she fights to save the shop from a corporate developer, rediscovering the true spirit of Christmas along the way. This family-friendly film blends romance, fantasy, and holiday cheer in a story about community, hope, and magic.
Director: Greta Gerwig
Cast:
- Emma Thompson as Emily, a kind-hearted and curious woman
- Ian McKellen as Mr. Grayson, the stern corporate developer
- Tom Hanks as George, the wise and elderly owner of the local cafe
- Zoe Saldana as Sarah, Emily's supportive best friend
- Jacob Tremblay as Timmy, a young boy with a special Christmas wish
Runtime: 105 minutes
Genres: Family, Fantasy, Romance, Holiday
Production budget: $25M
๐ต๏ธโโ๏ธ Dataset Overview
Brief information about the data sets.
def info_df(df):
data = []
for column in df.columns:
data.append({'Column_Name': column, 'Data_Type': df[column].dtype, 'Non-Null_Count': df[column].count(), 'Null_Count': df[column].isna().sum(), 'Percentage_NA': (df[column].isna().mean())*100, 'Unique_Values_Count': df[column].nunique()})
result_df = pd.DataFrame(data)
return result_dfdisplay(info_df(xmas_movies))display(info_df(top1k_movies))display(info_df(movie_budgets))Some numeric columns like gross in both xmas_movies and top1k_movies datasets are given in object type. We need to clean these columns and modefy them to convert into float type data. Again we have a big proportion of missing values in each dataset that must be considered. However, I will start by joinning the movie_budget dataset with xmas_movies and top1k_movies datasets because production budget is likely to be an importat feature.
โ
โ