Project: Predicting Movie Rental Durations

A DVD rental company needs your help! They want to figure out how many days a customer will rent a DVD for based on some features and has approached you for help. They want you to try out some regression models which will help predict the number of days a customer will rent a DVD for. The company wants a model which yeilds a MSE of 3 or less on a test set. The model you make will help the company become more efficient inventory planning.

The data they provided is in the csv file rental_info.csv. It has the following features:

"rental_date": The date (and time) the customer rents the DVD.
"return_date": The date (and time) the customer returns the DVD.
"amount": The amount paid by the customer for renting the DVD.
"amount_2": The square of "amount".
"rental_rate": The rate at which the DVD is rented for.
"rental_rate_2": The square of "rental_rate".
"release_year": The year the movie being rented was released.
"length": Lenght of the movie being rented, in minuites.
"length_2": The square of "length".
"replacement_cost": The amount it will cost the company to replace the DVD.
"special_features": Any special features, for example trailers/deleted scenes that the DVD also has.
"NC-17", "PG", "PG-13", "R": These columns are dummy variables of the rating of the movie. It takes the value 1 if the move is rated as the column name and 0 otherwise. For your convinience, the reference dummy has already been dropped.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import Lasso
from sklearn.ensemble import GradientBoostingRegressor

rentals = pd.read_csv("rental_info.csv")
rentals.head()

# # Convert “return_date” and “rental_date” into a datetime format using the pd.to_datetime() function
# rentals["return_date"] = pd.to_datetime(rentals["return_date"])
# rentals["rental_date"] = pd.to_datetime(rentals["rental_date"])

# Get the number of days by subtracting “rental_date” from “return_date” and store it in a new column called “rental_length_days”.
rentals["rental_length_days"] = (pd.to_datetime(rentals["return_date"]) - pd.to_datetime(rentals["rental_date"])).dt.days

rentals["rental_length_days"]

# Create two columns of dummy variables from “special_features” using the np.where() function
rentals["deleted_scenes"] = np.where(rentals["special_features"].str.contains("Deleted Scenes"), 1, 0)
rentals["behind_the_scenes"] = np.where(rentals["special_features"].str.contains("Behind the Scenes"), 1, 0)

rentals["special_features"].unique()

len(rental[rentals["deleted_scenes"] == 1])

# X = rentals.drop("rental_length_days", axis=1).values
# y = rentals["rental_length_days"].values

X = rentals.drop(['rental_date', 'return_date', 'rental_length_days', 'special_features'], axis=1)
y = rentals['rental_length_days']

# Perform a train-test split using the train_test_split() function from sklearn.model_selection
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=9)

# Perform feature selection using Lasso regression. Instantiate the Lasso() model with random_state=9 and a positive decimal value for the alpha keyword argument. Fit the model on the training data and access the feature importance using the .coef_ attribute. Subset the training and test features for columns with non-zero coefficients using .iloc[] and filter columns using the syntax lasso_coef > 0
lasso = Lasso(alpha=0.1, random_state=9)
lasso.fit(X_train, y_train)

lasso_coef = lasso.coef_
X_train_lasso = X_train.iloc[:, lasso_coef != 0]
X_test_lasso = X_test.iloc[:, lasso_coef != 0]

# Try different regression models such as LinearRegression(), DecisionTreeRegressor(), and RandomForestRegressor() to estimate the target variable based on the features. Train the models on the training data and use the .predict() function of the trained model to get fitted values. Use the mean_squared_error() function from sklearn.metrics to compute mean squared error.

# Linear Regression
lr = LinearRegression()
lr.fit(X_train_lasso, y_train)
lr_mse = mean_squared_error(y_test, lr.predict(X_test_lasso))
print(lr_mse)

‌
‌
‌