Project: Predicting Movie Rental Durations

A DVD rental company needs your help! They want to figure out how many days a customer will rent a DVD for based on some features and has approached you for help. They want you to try out some regression models which will help predict the number of days a customer will rent a DVD for. The company wants a model which yeilds a MSE of 3 or less on a test set. The model you make will help the company become more efficient inventory planning.

The data they provided is in the csv file rental_info.csv. It has the following features:

"rental_date": The date (and time) the customer rents the DVD.
"return_date": The date (and time) the customer returns the DVD.
"amount": The amount paid by the customer for renting the DVD.
"amount_2": The square of "amount".
"rental_rate": The rate at which the DVD is rented for.
"rental_rate_2": The square of "rental_rate".
"release_year": The year the movie being rented was released.
"length": Lenght of the movie being rented, in minuites.
"length_2": The square of "length".
"replacement_cost": The amount it will cost the company to replace the DVD.
"special_features": Any special features, for example trailers/deleted scenes that the DVD also has.
"NC-17", "PG", "PG-13", "R": These columns are dummy variables of the rating of the movie. It takes the value 1 if the move is rated as the column name and 0 otherwise. For your convinience, the reference dummy has already been dropped.

import pandas as pd
import numpy as np
import seaborn as sns
import datetime
from datetime import datetime as dt


from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

from matplotlib import pyplot as plt
from sklearn.tree import DecisionTreeClassifier, export_graphviz, export_text
from sklearn.metrics import classification_report, roc_auc_score
from sklearn.model_selection import train_test_split, GridSearchCV, cross_validate, validation_curve

# Import any additional modules and start coding below

df = pd.read_csv('rental_info.csv')
df.info()

df.isna().sum()
df['rental_date'].head()
df['return_date'].head()

df['rental_date'] = pd.to_datetime(df.rental_date)
df['return_date'] = pd.to_datetime(df.return_date)

df['rental_length_days'] = (df['return_date'] - df['rental_date']).dt.days

df['rental_length_days'].head(20)
df['rental_length_days'] = pd.to_datetime(df['rental_length_days']).astype('int64')


df["deleted_scenes"] =  np.where(df["special_features"].str.contains("Deleted Scenes"), 1,0)
df["behind_the_scenes"] =  np.where(df["special_features"].str.contains("Behind the Scenes"), 1,0)

# Avoid columns that leak data about the target
X = df.drop(['rental_length_days', 'special_features', 'rental_date', 'return_date'], axis=1)
# Target column
y = df['rental_length_days']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=9)

model = LinearRegression()

model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)

best_model = model
best_mse = mse


print(best_model)
print(best_mse)