Skip to content
0

Project:

The purpose of this project is to predict the expected grossing of our Christmas movie titled "The Magic of Bellmonte Lane".

After cleaning and validating the data, EDA and predictive modelling revealed the following insights about the expected value of our grossing for the movie. The executive summary of recommendations and the processes leading to these conclusions are narrated below.

Findings:

1, Our predicted grossing for "The Magic of Bellmonte Lane" is $79,638,336.

2, Greta Gerwig is a good choice for director, based on the data we have.

3, Outside of Christmas movies, the average grossing for movies with a 25-milion-dollar budget is $81,801,123.

4, Our predicted grossing of $79,638,336 is a decent figure, but we can always look to make more. I recommend we flavor the trailer of "The Magic of Bellmonte Lane" with a lot of romance and drama.

However, our prediction might change if we're able to build a more accurate model. To achieve this, we will need to have more accurate features supplied in the original Christmas movie dataset, e.g a column for budgets of Christmas movies, a complete column with no (or very few) missing values for grossings of Christmas movies, and the type of rating our movie will have.

THE PROCESS ↓

Data cleaning, validation and EDA

# Let's import the necesary libraries to read, clean and visualize the data

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Next, let's read the main data and see what it looks like, dropping duplicated rows and empty columns along the way

xmas = pd.read_csv('data/christmas_movies.csv').drop_duplicates().dropna(how = 'all', axis = 1)
display(xmas.head(), xmas.tail())

# also, let's see the data types of the columns in the dataframe, while also checking the columns with missing values

print(xmas.info())
# Data cleaning.
# Let's strip the letter M and the dollar sign ($) off the figures in the "gross" column and convert the figures to actual digits in millions (by multiplying the stripped figures by 1,000,000) while rounding off the decimals

xmas['gross'] = xmas['gross'].str.replace("$", "").str.replace("M", "").astype('float') * 1000000
xmas['gross'] = xmas['gross'].round(2)

# Let's now drop rows in the "gross" column that have null values, since we won't need those null values while training our model. We don't want to replace these null values in the column with the median value, so as not to have a false sense of grossing figures.

xmas = xmas.dropna(subset = 'gross')

# in the cell above, we see that the "release_year" column is a float type, we're now converting the column to a date time type and extracting the year 

xmas['release_year'] = pd.to_datetime(xmas['release_year'], format='%Y').dt.year

# a quick check to see what the "release_year" column data type is now
print ("release_year data type is", xmas['release_year'].dtypes, ", so it's now an integer.")

# and finally, let's see what our "xmas" dataframe looks like at the moment

display(xmas.head())
# Checking columns, their data types and missing values for each column
xmas.info()

Data validation.

After cleaning the data, I validated the data by checking each column individually to ensure the following:

'title' column, contains the title of each movie. The data type is 'object' and has no missing values.

'release_year' column, contains the year the movie was released. The data type is integer and has no missing values.

'description' column, a short description of the movie. The data type is 'object' and has no missing values.

'type' column, the type of production. The data type is 'object' and has no missing values.

'rating' column, the rating/certificate e.g. PG. The data type is 'object' and has no missing values.

'runtime' column, the movie runtime in minutes. The data type is 'float' and has no missing values.

'imdb_rating' column, the IMDB rating. The data type is 'object' and has no missing values.

'genre' column, the list of genres e.g. Comedy, Drama etc. The data type is 'object' and has no missing values.

'director' column, the director of the movie. The data type is 'object' and has no missing values.

'stars' column, a list of actors in the movie. The data type is 'object' and has no missing values.

'gross' column, the domestic gross of the movie in US dollars. The data type is 'float' and has no missing values.

All columns are as they should be.

Let's now check the next dataset and see how we can make use of it. First step is to view the dataset, clean it and validate the data.

top1000 = pd.read_csv("data/imdb_top1k.csv")
movie_budgets = pd.read_csv("data/movie_budgets.csv")
# How many null values do we have in the top1k_movies dataset? and what are the data types?

top1000.info()

# let's also visualzie what the dataframe looks like

display(top1000.head())
# Data cleaning.
# Starting with our point of interest, let's remove the commas in the "Gross" column of top1000 and convert the data type to floats for easier computation. Also, we will drop the rows in this column that have null values, so that it doesn't affect our model prediction

top1000['Gross'] = top1000['Gross'].str.replace("," , "").astype('float')
top1000 = top1000.dropna(subset = 'Gross')


# the "Released_Year" column is an "object", let's change it to an integer

top1000['Released_Year'] = pd.to_numeric(top1000['Released_Year'], errors='coerce').astype('Int64')

# let's convert the "Runtime" column from object to floats, after removing the "min" spellings in the column

top1000['Runtime'] = top1000['Runtime'].str.replace("min", "").astype('float')

#let's now see what our dataframe looks like so far

display(top1000.head())

# and let's see the data types of each column and how much null values we have
top1000.info()

We have a third dataset called the "movies_budget". Only in this dataset do we see budgets of movies. We will try to utilize this dataset by merging it with the top1000 dataset so that we can see the budgets of movies that are rated in the top 1000.

# the "Series_Title" column in the top1000 dataset is  an object type but the "Released_Year" column is an integer.

top1000[['Series_Title', 'Released_Year']].dtypes
# the "title" column in the movie_budgets dataset is also of the "object" type, but the "year" column is a float type

movie_budgets[['title', 'year']].dtypes