Skip to content

A DVD rental company needs your help! They want to figure out how many days a customer will rent a DVD for based on some features and has approached you for help. They want you to try out some regression models which will help predict the number of days a customer will rent a DVD for. The company wants a model which yeilds a MSE of 3 or less on a test set. The model you make will help the company become more efficient inventory planning.

The data they provided is in the csv file rental_info.csv. It has the following features:

  • "rental_date": The date (and time) the customer rents the DVD.
  • "return_date": The date (and time) the customer returns the DVD.
  • "amount": The amount paid by the customer for renting the DVD.
  • "amount_2": The square of "amount".
  • "rental_rate": The rate at which the DVD is rented for.
  • "rental_rate_2": The square of "rental_rate".
  • "release_year": The year the movie being rented was released.
  • "length": Lenght of the movie being rented, in minuites.
  • "length_2": The square of "length".
  • "replacement_cost": The amount it will cost the company to replace the DVD.
  • "special_features": Any special features, for example trailers/deleted scenes that the DVD also has.
  • "NC-17", "PG", "PG-13", "R": These columns are dummy variables of the rating of the movie. It takes the value 1 if the move is rated as the column name and 0 otherwise. For your convinience, the reference dummy has already been dropped.

The necessary Imports and the Preprocessing needed to prepare the data for modelling

import pandas as pd
import numpy as np
SEED = 9
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as MSE
from sklearn.linear_model import Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import RandomizedSearchCV
import matplotlib.pyplot as plt
import seaborn as sns

# Import any additional modules and start coding below
df_rental = pd.read_csv('rental_info.csv')

# The dependent variable that will be used for the prediction:
df_rental["rental_length"] = pd.to_datetime(df_rental["return_date"]) - pd.to_datetime(df_rental["rental_date"])
df_rental["rental_length_days"] = df_rental["rental_length"].dt.days

df_rental["deleted_scenes"] =  np.where(df_rental["special_features"].str.contains("Deleted Scenes"), 1, 0)

df_rental["behind_the_scenes"] =  np.where(df_rental["special_features"].str.contains("Behind the Scenes"), 1, 0)

# Choose columns to drop
cols_to_drop = ["special_features", "rental_length", "rental_length_days", "rental_date", "return_date"]

# Split into feature and target sets
X = df_rental.drop(cols_to_drop, axis=1)
y = df_rental["rental_length_days"]

# Further split into training and test data
X_train,X_test,y_train,y_test = train_test_split(X, 
                                                 y, 
                                                 test_size=0.2, 
                                                 random_state=9)

Performing Lasso Regression to get the best features for prediction:

lasso = Lasso(alpha=0.3, random_state=9) 

# Train the model and access the coefficients
lasso.fit(X_train, y_train)
lasso_coef = lasso.coef_

# Perform feature selection by choosing columns with positive coefficients
X_lasso_train = X_train.iloc[:, lasso_coef > 0]
X_lasso_test = X_test.iloc[:, lasso_coef > 0]

For this Project I will be using 3 models to try to achieve a MSE value less than 3 on the test set.

These models being:

1- Decision Tree Regressor

2- Random Forest Regressor

3- Decision Tree Regressor with SGB and using Randomized Cross-Validation to tune the hyperparamers

1. The Decision Tree Regressor Model:

dt = DecisionTreeRegressor(random_state=SEED)
dt.fit(X_train, y_train)

y_dt_pred = dt.predict(X_test)
mse_dt = MSE(y_test, y_dt_pred)
print(mse_dt)
#print("Selected features:", list(selected_features))

2. The Random Forest Regressor Model

rf = RandomForestRegressor(random_state = SEED)
rf.fit(X_train, y_train)

y_rf_pred = rf.predict(X_test)
mse_rf = MSE(y_test, y_rf_pred)
print(mse_rf)

3. Decision Tree with SGB (SGBR) and using Randomized Cross-Validation

params_sgbr = {'max_depth': np.arange(1, 11, 1),
               'n_estimators': np.arange(1, 201, 1)}

sgbr = GradientBoostingRegressor(random_state=SEED)

rand_sgbr = RandomizedSearchCV(sgbr,
                               params_sgbr,
                               scoring='neg_mean_squared_error',
                               cv=3,
                               verbose=2,
                               n_jobs=-1,
                               random_state=SEED)

rand_sgbr.fit(X_train, y_train)

best_params = rand_sgbr.best_params_

sgbr = GradientBoostingRegressor(max_depth=best_params['max_depth'],          
                                 n_estimators=best_params['n_estimators'],
                                 random_state=SEED)

sgbr.fit(X_train, y_train)

y_sgbr_pred = sgbr.predict(X_test)
mse_sgbr = MSE(y_test, y_sgbr_pred)

Final Results of the Best Model and the Best MSE value

best_model = sgbr
best_mse = mse_sgbr
print(best_mse)

Comparison between the various models used for the predictions: