Skip to content

A DVD rental company needs your help! They want to figure out how many days a customer will rent a DVD for based on some features and has approached you for help. They want you to try out some regression models which will help predict the number of days a customer will rent a DVD for. The company wants a model which yeilds a MSE of 3 or less on a test set. The model you make will help the company become more efficient inventory planning.

The data they provided is in the csv file rental_info.csv. It has the following features:

  • "rental_date": The date (and time) the customer rents the DVD.
  • "return_date": The date (and time) the customer returns the DVD.
  • "amount": The amount paid by the customer for renting the DVD.
  • "amount_2": The square of "amount".
  • "rental_rate": The rate at which the DVD is rented for.
  • "rental_rate_2": The square of "rental_rate".
  • "release_year": The year the movie being rented was released.
  • "length": Lenght of the movie being rented, in minuites.
  • "length_2": The square of "length".
  • "replacement_cost": The amount it will cost the company to replace the DVD.
  • "special_features": Any special features, for example trailers/deleted scenes that the DVD also has.
  • "NC-17", "PG", "PG-13", "R": These columns are dummy variables of the rating of the movie. It takes the value 1 if the move is rated as the column name and 0 otherwise. For your convinience, the reference dummy has already been dropped.
import pandas as pd
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, cross_val_score, RandomizedSearchCV
from sklearn.metrics import mean_squared_error as mse, r2_score
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor, AdaBoostRegressor, GradientBoostingRegressor

# Import any additional modules and start coding below

MAIN GOAL

Create supervised model which score 3 MSE at MAX IN PREDICTING THE NUMBER OF DAYS A CUSTOMER RENTS DVDs FOR

Target

Target variable is a 'rental_length_days' which need to be engineered from 'rental_date' and 'return_date'

'rental_date' & 'return_date' need to be EXCLUDED from FEATURES cuz they will leak informations about TARGET

Desired Workflow

  1. Insepct data - look for missing values and inproper data types
  2. Look for a potential FEATURES
  3. Engineer our TARGET variable
  4. Prepare y and X arrays
  5. Choose models we'll try on our dataset
  6. Split for Train and Test set
  7. Instantiate models and CrossValidation (Random for speeding up evaluation process)
  8. Fit and predict data
  9. Compare MSE metrics for all models
  10. Choose model NOTE: Set random_state = 9 for all stochastic processess to allow TESTING when Submiting project

Importing Data and performing EDA

# Import .csv and save as DF
rental = pd.read_csv('rental_info.csv', parse_dates = ['rental_date','return_date'])
# EDA Dataset
rental.head()
rental.info()
rental.describe()
rental['special_features'].value_counts(normalize=True)

EDA insights

  1. Our data set is properly formated (proper datatypes)
  2. We have no null value (No missingness)
  3. We have 1 object type column ('special_features') which need to be encoded
  4. There are no abysmal values like negative or out of "reasonable" range
  5. We have 3 columns containing squared values of other columns. Since there are no need of squaring oryginal values those (squared) columns will be DROPED from Dataset
  6. Since there are wide spread between column values we should STANDARIZE our features

Data preprocessing

# Creating TARGET variable: Difference between 'return_date' and 'rental_date' in DAYS
rental['rental_length_days'] = (rental['return_date'] - rental['rental_date']).dt.days

# Drop unnecessary columns 'return_date' and 'rental_date' to avoid leaking data about TARGET
rental.drop(columns = ['return_date', 'rental_date'], inplace = True)
# Drop unnecessary columns which SQUARED values: 'amount_2', 'length_2', 'rental_rate_2'
#sqrt_cols = ['amount_2', 'length_2', 'rental_rate_2']

#rental.drop(columns = sqrt_cols, inplace = True)
# Preview dataset after drops
rental.info()
# Encode special_features but only for: 'Deleted Scenes' and 'Behind the Scenes' to meet project requirements
str_to_encode = {
    'Deleted Scenes' : 'deleted_scenes',
    'Behind the Scenes' : 'behind_the_scenes'}

for key, value in str_to_encode.items():
    rental[value] = rental['special_features'].str.contains(key).astype(bool)
    assert rental[value].sum() == rental['special_features'].str.contains(key).sum(), \
        f"Something went wrong. Encoded values count {rental[value].sum()} != oryginal values count: {rental['special_features'].str.contains(key).sum()}"

# Drop special_features column since its not need anymore
rental.drop(columns = ['special_features'], inplace = True)