Skip to content

A DVD rental company needs your help! They want to figure out how many days a customer will rent a DVD for based on some features and has approached you for help. They want you to try out some regression models which will help predict the number of days a customer will rent a DVD for. The company wants a model which yeilds a MSE of 3 or less on a test set. The model you make will help the company become more efficient inventory planning.

The data they provided is in the csv file rental_info.csv. It has the following features:

  • "rental_date": The date (and time) the customer rents the DVD.
  • "return_date": The date (and time) the customer returns the DVD.
  • "amount": The amount paid by the customer for renting the DVD.
  • "amount_2": The square of "amount".
  • "rental_rate": The rate at which the DVD is rented for.
  • "rental_rate_2": The square of "rental_rate".
  • "release_year": The year the movie being rented was released.
  • "length": Lenght of the movie being rented, in minuites.
  • "length_2": The square of "length".
  • "replacement_cost": The amount it will cost the company to replace the DVD.
  • "special_features": Any special features, for example trailers/deleted scenes that the DVD also has.
  • "NC-17", "PG", "PG-13", "R": These columns are dummy variables of the rating of the movie. It takes the value 1 if the move is rated as the column name and 0 otherwise. For your convinience, the reference dummy has already been dropped.

1 hidden cell

How to approach this project

  1. Getting the number of rental days.

  2. Adding dummy variables using the special features column.

  3. Executing a train-test split

  4. Performing feature selection

  5. Choosing models and performing hyperparameter tuning

  6. Predicting values on test set

  7. Computing mean squared error

import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Import any additional modules and start coding below
# Investigate the dataset
df = pd.read_csv('rental_info.csv')

display(df.head())
display(df.info())
display(df.describe())
display(df.isna().sum())
# 1. Getting the number of rental days.
# Convert the columns to datetime
df['rental_date'] = pd.to_datetime(df['rental_date'], format='%Y-%m-%d %H:%M:%S%z', utc=True)
df['return_date'] = pd.to_datetime(df['return_date'], format='%Y-%m-%d %H:%M:%S%z', utc=True)

# Create a new column 'rental_length_days'
df['rental_length_days'] = df['return_date'] - df['rental_date']

# Select only days, excluding other components.
df['rental_length_days'] = df['rental_length_days'].dt.days
# Create two columns of dummy variables from "special_features", which takes the value of 1 when:
# The value is "Deleted Scenes", storing as a column called "deleted_scenes".
# The value is "Behind the Scenes", storing as a column called "behind_the_scenes".

df['deleted_scenes'] = np.where(df['special_features'].str.contains('Deleted Scenes'), 1,0)
df['behind_the_scenes'] = np.where(df['special_features'].str.contains('Behind the Scenes'), 1,0)
# Executing a train-test split
#Split the data into train and test sets, avoiding any features that leak data about the target variable, and include 20% of the total data in the test set.

# Decide which columns to use
cols_to_drop =["special_features", "rental_length_days", "rental_date", "return_date"]
X = df.drop(cols_to_drop, axis=1)
y = df['rental_length_days'] 

# Split data into train & test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=9)
# Performing feature selection
# Use Lasso model - why?
# For lasso
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Create the Lasso model
lasso = Lasso(alpha=0.3, random_state=9) 

# Train the model and access the coefficients
lasso.fit(X_train, y_train)
lasso_coef = lasso.coef_

# Perform feature selectino by choosing columns with positive coefficients
X_lasso_train, X_lasso_test = X_train.iloc[:, lasso_coef > 0], X_test.iloc[:, lasso_coef > 0]
# Choosing models and performing hyperparameter tuning - Try a variety of regression models.

# Run OLS ("Ordinary Least Squares")
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

ols = LinearRegression()
ols = ols.fit(X_lasso_train, y_train)
y_test_pred = ols.predict(X_lasso_test)
mse_lin_reg_lasso = mean_squared_error(y_test, y_test_pred)


# Random forest
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV

# Random forest hyperparameter space
param_dist = {'n_estimators': np.arange(1,101,1),
          'max_depth':np.arange(1,11,1)}

# Create a random forest regressor
rf = RandomForestRegressor()

# Use random search to find the best hyperparameters
rand_search = RandomizedSearchCV(rf, 
                                 param_distributions=param_dist, 
                                 cv=5, 
                                 random_state=9)

# Fit the random search object to the data
rand_search.fit(X_train, y_train)

# Create a variable for the best hyper param
hyper_params = rand_search.best_params_

# {'n_estimators': 51, 'max_depth': 10}

# Run the random forest on the chosen hyper parameters
rf = RandomForestRegressor(n_estimators=hyper_params["n_estimators"], 
                           max_depth=hyper_params["max_depth"], 
                           random_state=9)

rf.fit(X_train,y_train)
rf_pred = rf.predict(X_test)
mse_random_forest= mean_squared_error(y_test, rf_pred)

# Random forest gives lowest MSE so:
best_model = rf
best_mse = mse_random_forest
# Print and check the dataset
display(df['rental_length_days'])
display(df)
mse_lin_reg_lasso
hyper_params
display(best_model)
display(best_mse)
display(mse_lin_reg_lasso)
display(mse_random_forest)