Predicting Hotel Cancellations
🏨 Background
You are supporting a hotel with a project aimed to increase revenue from their room bookings. They believe that they can use data science to help them reduce the number of cancellations. This is where you come in!
They have asked you to use any appropriate methodology to identify what contributes to whether a booking will be fulfilled or cancelled. They intend to use the results of your work to reduce the chance someone cancels their booking.
The Data
They have provided you with their bookings data in a file called hotel_bookings.csv, which contains the following:
| Column | Description |
|---|---|
Booking_ID | Unique identifier of the booking. |
no_of_adults | The number of adults. |
no_of_children | The number of children. |
no_of_weekend_nights | Number of weekend nights (Saturday or Sunday). |
no_of_week_nights | Number of week nights (Monday to Friday). |
type_of_meal_plan | Type of meal plan included in the booking. |
required_car_parking_space | Whether a car parking space is required. |
room_type_reserved | The type of room reserved. |
lead_time | Number of days before the arrival date the booking was made. |
arrival_year | Year of arrival. |
arrival_month | Month of arrival. |
arrival_date | Date of the month for arrival. |
market_segment_type | How the booking was made. |
repeated_guest | Whether the guest has previously stayed at the hotel. |
no_of_previous_cancellations | Number of previous cancellations. |
no_of_previous_bookings_not_canceled | Number of previous bookings that were canceled. |
avg_price_per_room | Average price per day of the booking. |
no_of_special_requests | Count of special requests made as part of the booking. |
booking_status | Whether the booking was cancelled or not. |
Source (data has been modified): https://www.kaggle.com/datasets/ahsan81/hotel-reservations-classification-dataset
Setup
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(tidymodels))
suppressPackageStartupMessages(library(rpart.plot))
hotels <- readr::read_csv('data/hotel_bookings.csv', show_col_types = FALSE)
Data Clean / Pre-process
# Removing rows with missing values
hotels2 <- hotels %>% filter_all(all_vars(complete.cases(.)))
# Creating dummy variable 'canceled' indicated by booking_status
hotels2$canceled <- recode(hotels2$booking_status, "Not_Canceled" = 0, "Canceled" = 1)
# Defining dataset being used for analysis
data <- hotels2 %>% select(-Booking_ID, -booking_status)
# Test / Train Split
split <- initial_split(data, prop = 0.7, strata = canceled)
train <- training(split)
test <- testing(split)Decision Tree Model
# Specify Decision Tree Model
tree_spec <- decision_tree() %>%
set_engine('rpart') %>%
set_mode('regression')
# Train model on training data
tree_fit <- tree_spec %>% fit(canceled ~ ., data = train)
# Plot tree
rpart.plot(tree_fit$fit)Making Predictions using the Decision Tree Model
Plotting the ROC AUC
# Make predictions on test data set and combine
dt_results <- predict(tree_fit, test) %>%
cbind(test$canceled, .)
colnames(dt_results) <- c('canceled','pred_prob')
# Calculate and plot roc auc
dt_results %>% mutate(canceled = as.factor(canceled)) %>%
roc_auc(., truth=canceled, pred_prob, event_level = 'second')
dt_results %>% mutate(canceled = as.factor(canceled)) %>%
roc_curve(., canceled, pred_prob, event_level = 'second') %>% autoplot()
Decision Tree Ensemble / Random Forest Model
Improving the ROC AUC
set.seed(123)
# Specify Random Forest Model
rf_spec <- rand_forest() %>%
set_engine('ranger') %>%
set_mode('regression')
rf_fit <- rf_spec %>% fit(canceled ~ ., data = train)# Make predictions on test data using ensemble model and combine
rf_results <- predict(rf_fit, test) %>%
cbind(test$canceled, .)
colnames(rf_results) <- c('canceled', 'pred_prob')
# Calculate and plot roc auc
rf_results %>% mutate(canceled = as.factor(canceled)) %>%
roc_auc(., canceled, pred_prob, event_level = 'second')
rf_results %>% mutate(canceled = as.factor(canceled)) %>%
roc_curve(., canceled, pred_prob, event_level = 'second') %>% autoplot()Making Predictions using the Random Forest Model
Plotting the Final Results
The final ensemble model is ultimately successful at separating the data and predicting both true negatives and true positive outcomes. The ROC AUC of the final model's predictions on the withheld (test) data is 0.95, which is a near perfect score. An ROC AUC of 0.50 would indicate the model does no better than a random guess.
# making predictions on the entire data set
final_results <- predict(rf_fit, data) %>% cbind(data$canceled, .)
colnames(final_results) <- c('canceled','pred_prob')