💪 Competition challenge
Create a report that covers the following:
- Exploratory data analysis of the dataset with informative plots. It's up to you what to include here! Some ideas could include:
- Analysis of the genres
- Descriptive statistics and histograms of the grossings
- Word clouds
- Develop a model to predict the movie's domestic gross based on the available features.
- Remember to preprocess and clean the data first.
- Think about what features you could define (feature engineering), e.g.:
- number of times a director appeared in the top 1000 movies list,
- highest grossing for lead actor(s),
- decade released
- Evaluate your model using appropriate metrics.
- Explain some of the limitations of the models you have developed. What other data might help improve the model?
- Use your model to predict the grossing of the following fictitious Christmas movie:
Title: The Magic of Bellmonte Lane
Description: "The Magic of Bellmonte Lane" is a heartwarming tale set in the charming town of Bellmonte, where Christmas isn't just a holiday, but a season of magic. The story follows Emily, who inherits her grandmother's mystical bookshop. There, she discovers an enchanted book that grants Christmas wishes. As Emily helps the townspeople, she fights to save the shop from a corporate developer, rediscovering the true spirit of Christmas along the way. This family-friendly film blends romance, fantasy, and holiday cheer in a story about community, hope, and magic.
Director: Greta Gerwig
Cast:
- Emma Thompson as Emily, a kind-hearted and curious woman
- Ian McKellen as Mr. Grayson, the stern corporate developer
- Tom Hanks as George, the wise and elderly owner of the local cafe
- Zoe Saldana as Sarah, Emily's supportive best friend
- Jacob Tremblay as Timmy, a young boy with a special Christmas wish
Runtime: 105 minutes
Genres: Family, Fantasy, Romance, Holiday
Production budget: $25M
1. Exploratory data analysis of the dataset with informative plots
import math
import re
from collections import Counter
from typing import Set, Dict
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from wordcloud import WordCloud
from PIL import Image
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import MinMaxScaler
from pycaret.regression import *
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')
Input
xmas_movies = pd.read_csv('data/christmas_movies.csv')
movie_budgets = pd.read_csv('data/movie_budgets.csv')
xmas_movies.head(3)
1.1 Analysis of the genres in x-mas Movies
In our analysis of Christmas-themed movies, we observed a remarkable diversity in the genre spectrum. These holiday films encompass an impressive array of 26 unique genres. This wide range of genres signifies the broad appeal and popularity of Christmas movies, highlighting their ability to resonate across various cinematic styles and audience preferences.
An analysis of Christmas movie genres reveals a distinct preference for heartwarming and jovial themes during the holiday season. Comedy leads the charge as the most popular genre with a count of 452, underscoring the desire for humor and light-heartedness. It's closely followed by Drama at 414 and Romance at 385, both of which highlight the seasonal trend towards emotive storytelling that captures the spirit of love and family. The Family genre, with 282 occurrences, also resonates strongly, suggesting that viewers seek out films that can be enjoyed collectively by all ages. While genres like Fantasy and Adventure offer a sense of escapism with counts of 91 and 47 respectively, the lower frequency of genres such as Sci-Fi, Western, and War, each scoring under 10, indicates a lesser inclination towards more intense or niche film experiences during the Christmas period
xmas_movies = xmas_movies[pd.notnull(xmas_movies['genre'])]
xmas_movies['main_genre'] = xmas_movies['genre'].apply(lambda x: x.split(',')[0].strip())
genre_gross_mean = xmas_movies.groupby('main_genre')['gross'].mean().reset_index()
average_gross = genre_gross_mean['gross'].mean()
average_row = pd.DataFrame({'main_genre': ['AVG'], 'gross': [average_gross]})
genre_gross_mean = pd.concat([genre_gross_mean, average_row], ignore_index=True)
top_genres = genre_gross_mean.sort_values(by='gross', ascending=False).head(9)
colors = ['chartreuse' if genre == 'AVG' else '#440154' for genre in top_genres['main_genre']]
fig_gross = go.Figure(data=[go.Bar(x=top_genres['main_genre'], y=top_genres['gross'], marker_color=colors, text=round(top_genres['gross'], 0))])
fig_gross.update_layout(title='Top 9 Average Gross by Genre', xaxis_title='Main Genre', yaxis_title='Average Gross',)
fig_gross.show()
We can observe that the 'Action' genre leads with a significant margin, indicating a strong preference for high-adrenaline content during the festive season. 'Animation' and 'Adventure' genres also perform well, likely due to their family-friendly appeal, which aligns with holiday viewing habits. Notably, the genre categorized as 'AVG' represents the average gross across all genres and stands out within the top performers, suggesting a fairly robust performance across the board. Genres like 'Drama' and 'Comedy' maintain a solid presence, reflecting their traditional appeal, whereas 'Biography' and 'Horror' appear to have a niche audience given their lower average gross figures. This data underscores the diverse cinematic tastes that emerge during Christmas, with a clear leaning towards genres that offer escapism and align with the spirited atmosphere of the season.