Skip to content

A DVD rental company needs your help! They want to figure out how many days a customer will rent a DVD for based on some features and has approached you for help. They want you to try out some regression models which will help predict the number of days a customer will rent a DVD for. The company wants a model which yeilds a MSE of 3 or less on a test set. The model you make will help the company become more efficient inventory planning.

The data they provided is in the csv file rental_info.csv. It has the following features:

  • "rental_date": The date (and time) the customer rents the DVD.
  • "return_date": The date (and time) the customer returns the DVD.
  • "amount": The amount paid by the customer for renting the DVD.
  • "amount_2": The square of "amount".
  • "rental_rate": The rate at which the DVD is rented for.
  • "rental_rate_2": The square of "rental_rate".
  • "release_year": The year the movie being rented was released.
  • "length": Lenght of the movie being rented, in minuites.
  • "length_2": The square of "length".
  • "replacement_cost": The amount it will cost the company to replace the DVD.
  • "special_features": Any special features, for example trailers/deleted scenes that the DVD also has.
  • "NC-17", "PG", "PG-13", "R": These columns are dummy variables of the rating of the movie. It takes the value 1 if the move is rated as the column name and 0 otherwise. For your convinience, the reference dummy has already been dropped.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Import any additional modules and start coding below

01. Data Validation

rental = pd.read_csv('rental_info.csv', parse_dates=['rental_date','return_date'])
# define view_df function to view dataframe info, columns dtype
def view_df(df):
    print(df.info())
    print('-'*50)
    print(df.describe())
    print('-'*50)
    for col in df.columns:
        if (df[col].dtype == 'object')|(df[col].dtype == 'category'):
            print(df[col].value_counts())
view_df(rental)

02. Target and possible cause

Since we want to predict the number of days a customer will rent a DVD for, we use the return date - rental date to calculate it. Other factors may affect the number of rental days from DVDs are that they have some features which were not provided from their movies or TV shows, such as Deleted Scenes and Behind the Scenes.

# calculate the rental days a DVD has been rented by a customer
rental['rental_length_days'] = (rental['return_date'] - rental['rental_date']).dt.days
rental['deleted_scenes'] = [1 if 'Deleted Scenes' in x else 0 for x in rental['special_features']]
rental['behind_the_scenes'] = [1 if 'Behind the Scenes' in x else 0 for x in rental['special_features']]
rental.info()

03. Seperate the train and test data and import the neccery models: DecisionTree, RandomForest, Lasso

X = rental.drop(['rental_date', 'return_date', 'special_features', 'rental_length_days'], axis=1)
y = rental.rental_length_days
# import needed Regressor from sklearn
from sklearn.tree import DecisionTreeRegressor as DTR
from sklearn.ensemble import RandomForestRegressor as RFR
from sklearn.linear_model import Lasso, LassoCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.preprocessing import StandardScaler
# Standardrize the features
scaler = StandardScaler()
scaled = scaler.fit_transform(X)
X_scaled = pd.DataFrame(scaled, columns=X.columns)

seed = 94021419
# train test split
X_train, X_test, y_train, y_test = train_test_split(X_scaled,y,test_size=0.2, random_state=seed)
dtr = DTR(random_state=seed)
rfr = RFR(random_state=seed)
lasso = Lasso(random_state=seed)
lcv = LassoCV(cv=15, random_state=seed)