Skip to content

A DVD rental company needs your help! They want to figure out how many days a customer will rent a DVD for based on some features and has approached you for help. They want you to try out some regression models which will help predict the number of days a customer will rent a DVD for. The company wants a model which yeilds a MSE of 3 or less on a test set. The model you make will help the company become more efficient inventory planning.

The data they provided is in the csv file rental_info.csv. It has the following features:

  • "rental_date": The date (and time) the customer rents the DVD.
  • "return_date": The date (and time) the customer returns the DVD.
  • "amount": The amount paid by the customer for renting the DVD.
  • "amount_2": The square of "amount".
  • "rental_rate": The rate at which the DVD is rented for.
  • "rental_rate_2": The square of "rental_rate".
  • "release_year": The year the movie being rented was released.
  • "length": Lenght of the movie being rented, in minuites.
  • "length_2": The square of "length".
  • "replacement_cost": The amount it will cost the company to replace the DVD.
  • "special_features": Any special features, for example trailers/deleted scenes that the DVD also has.
  • "NC-17", "PG", "PG-13", "R": These columns are dummy variables of the rating of the movie. It takes the value 1 if the move is rated as the column name and 0 otherwise. For your convinience, the reference dummy has already been dropped.

0. Import the data and main python libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = pd.read_csv("rental_info.csv")
print("Data shape= ", data.shape)
data.head()

1. Explatory Data Analysis

data.info()
## 1.1 Getting the number of rental days.
data["rental_date"] = pd.to_datetime(data["rental_date"])
data["return_date"] = pd.to_datetime(data["return_date"])
rental_length = data["return_date"] - data["rental_date"]
data["rental_length"] = rental_length.dt.days
data["special_features"].value_counts()
data["deleted_scenes"] = np.where(data["special_features"].str.contains("Deleted Scenes"), 1,0)
data["behind_the_scenes"] = np.where(data["special_features"].str.contains("Behind the Scenes"), 1,0)
data.head(1)
numeric_data = data.drop(["rental_date","return_date","special_features"], axis = 1)
numeric_data.head(1)
axes = numeric_data.hist(xlabelsize=6, ylabelsize=6,layout=(3,5), figsize=(10,6));
for ax in axes.flatten():
    ax.title.set_size(10)  # Set the title size to 10, adjust as needed

plt.tight_layout()  # Adjust the layout
plt.show()
import seaborn as sns
corr = numeric_data.corr()
plt.figure(figsize=(8, 6))
sns.heatmap(corr, annot=True, fmt=".2f", cmap='coolwarm', linewidths=0.5)

plt.title('Heatmap of Correlations')
plt.show()

2. Split the data

seed = 9