Predicting hotel booking cancellations in Python
In this workspace, we will build a machine learning model to predict whether or not a customer cancelled a hotel booking.
We will use a dataset on hotel bookings from the article "Hotel booking demand datasets", published in the Elsevier journal, Data in Brief. The abstract of the article states
This data article describes two datasets with hotel demand data. One of the hotels (H1) is a resort hotel and the other is a city hotel (H2). Both datasets share the same structure, with 31 variables describing the 40,060 observations of H1 and 79,330 observations of H2. Each observation represents a hotel booking. Both datasets comprehend bookings due to arrive between the 1st of July of 2015 and the 31st of August 2017, including bookings that effectively arrived and bookings that were canceled.
For convenience, the two datasets have been combined into a single csv file data/hotel_bookings.csv
. Let us start by importing all the functions needed to import, visualize and model the data.
# Data imports
import pandas as pd
import numpy as np
# Visualization imports
import plotly.express as px
# ML Imports and configuration
from sklearn.model_selection import train_test_split, KFold, cross_validate, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn import set_config
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import confusion_matrix
set_config(display="diagram")
1. Import the data
The first step in any machine learning workflow is to get the data and explore it.
hotel_bookings = pd.read_csv('data/hotel_bookings.csv')
hotel_bookings.head()
As a quick exploration, let us look at the number of bookings by month.
bookings_by_month = hotel_bookings.groupby('arrival_date_month', as_index=False)[['hotel']].count().rename(columns={"hotel": "nb_bookings"})
months = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']
fig = px.bar(
bookings_by_month,
x='arrival_date_month',
y='nb_bookings',
title=f'Hotel Bookings by Month',
category_orders={"arrival_date_month": months}
)
fig.show(config={"displayModeBar": False})
Our objective is to build a classification model - or classifier - that predicts whether or not a user cancelled a hotel booking.
1. Split the data into training and test sets.
Let us start by defining a split to divide the data into training and test sets. The basic idea is to train the model on a portion of the data and test its performance on the other portion that has not been seen by the model. This is done in order to prevent overfitting.
# List all numberical features
features_num = [
"lead_time", "arrival_date_week_number", "arrival_date_day_of_month", "stays_in_weekend_nights",
"stays_in_week_nights", "adults", "children", "babies", "is_repeated_guest" ,
"previous_cancellations", "previous_bookings_not_canceled", "agent", "company",
"required_car_parking_spaces", "total_of_special_requests", "adr"
]
# List all categorical features
features_cat = [
"hotel", "arrival_date_month", "meal", "market_segment", "distribution_channel",
"reserved_room_type", "deposit_type", "customer_type"
]
features = features_num + features_cat
X = hotel_bookings[features]
y = hotel_bookings["is_canceled"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 420)
2. Preprocess the data
The next step is to set up a pipeline to preprocess the features. We will impute all missing values with a constant, and one-hot encode all categorical features.
transformer_num = SimpleImputer(strategy="constant")
transformer_cat = Pipeline(steps=[
("imputer", SimpleImputer(strategy="constant", fill_value="Unknown")),
("onehot", OneHotEncoder(handle_unknown='ignore'))
])
preprocessor = ColumnTransformer(transformers=[
("num", transformer_num, features_num),
("cat", transformer_cat, features_cat)
])
preprocessor
4. Fit the models and evaluate performance
Next, we extend the pipeline to fit a Decision Tree model on the training data.
# Compose data preprocessing and model into a single pipeline
steps = Pipeline(steps=[
('preprocessor', preprocessor),
('model', DecisionTreeClassifier(random_state=1234))
])
steps.fit(X_train, y_train)
To see how well our model performed, we'll calculate and visualize a confusion matrix, and calculate the accuracy of the model
plot_confusion_matrix(steps, X_test, y_test);