Project Background
A DVD rental company is seeking assistance in determining the number of days customers typically rent DVDs, using various features as predictors. They aim to explore regression models that can accurately predict rental durations. The goal is to develop a model that achieves a mean squared error (MSE) of 3 or less on a test dataset. Such a model would enable the company to optimize its inventory planning processes.
The Data
The data they provided is in the csv file rental_info.csv
. It has the following features:
"rental_date"
: The date (and time) the customer rents the DVD."return_date"
: The date (and time) the customer returns the DVD."amount"
: The amount paid by the customer for renting the DVD."amount_2"
: The square of"amount"
."rental_rate"
: The rate at which the DVD is rented for."rental_rate_2"
: The square of"rental_rate"
."release_year"
: The year the movie being rented was released."length"
: Length of the movie being rented, in minutes."length_2"
: The square of"length"
."replacement_cost"
: The amount it will cost the company to replace the DVD."special_features"
: Any special features, for example trailers/deleted scenes that the DVD also has."NC-17"
,"PG"
,"PG-13"
,"R"
: These columns are dummy variables of the rating of the movie. It takes the value 1 if the movie is rated as the column name and 0 otherwise. For your convenience, the reference dummy has already been dropped.
Project Instructions
In this project, you will use regression models to predict the number of days a customer rents DVDs for.
As with most data science projects, you will need to pre-process the data provided, in this case, a csv file called rental_info.csv. Specifically, you need to:
- Read in the csv file rental_info.csv using pandas.
- Create a column named "rental_length_days" using the columns "return_date" and "rental_date", and add it to the pandas DataFrame. This column should contain information on how many days a DVD has been rented by a customer.
- Create two columns of dummy variables from "special_features", which takes the value of 1 when:
- The value is "Deleted Scenes", storing as a column called "deleted_scenes".
- The value is "Behind the Scenes", storing as a column called "behind_the_scenes".
- Make a pandas DataFrame called X containing all the appropriate features you can use to run the regression models, avoiding columns that leak data about the target.
- Choose the "rental_length_days" as the target column and save it as a pandas Series called y.
Following the preprocessing you will need to:
- Split the data into X_train, y_train, X_test, and y_test train and test sets, avoiding any features that leak data about the target variable, and include 20% of the total data in the test set.
- Set random_state to 9 whenever you use a function/method involving randomness, for example, when doing a test-train split.
- Recommend a model yielding a mean squared error (MSE) less than 3 on the test set
Save the model you would recommend as a variable named best_model, and save its MSE on the test set as best_mse.
Importation of Dependencies
I import the libraries, classes, and functions necessary for data analysis, machine learning model development, and evaluation. I import pandas and numpy for data manipulation and numerical computations. Additionally, I import modules from scikit-learn for splitting data into training and testing sets (train_test_split), evaluating model performance using mean squared error (mean_squared_error), and implementing machine learning algorithms such as Lasso, LinearRegression, and RandomForestRegressor. I also impport RandomizedSearchCV for hyperparameter tuning.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import Lasso
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV
Initial Data Exploration
Here, I load and inspect the provided csv file. I read the csv file into a pandas DataFrame called rental_info, and I call the head() method to preview the first few rows of the DataFrame. Additionally I use the info() method to display an overview of the DataFrame, including the data types of each column and the number of non-missing values, providing a quick assessment of the dataset's structure and completeness.
# Read in 'rental_info.csv'
rental_info = pd.read_csv('rental_info.csv')
Below are the first few rows of the rental_info DataFrame.
# See the first few rows of rental_info
rental_info.head()
# Call info method on the DataFrame to see data types and number of missing values
rental_info.info()