Competition - predict hotel cancellation

Predicting Hotel Cancellations

🏨 Background

You are supporting a hotel with a project aimed to increase revenue from their room bookings. They believe that they can use data science to help them reduce the number of cancellations. This is where you come in!

They have asked you to use any appropriate methodology to identify what contributes to whether a booking will be fulfilled or cancelled. They intend to use the results of your work to reduce the chance someone cancels their booking.

The Data

They have provided you with their bookings data in a file called hotel_bookings.csv, which contains the following:

Column	Description
`Booking_ID`	Unique identifier of the booking.
`no_of_adults`	The number of adults.
`no_of_children`	The number of children.
`no_of_weekend_nights`	Number of weekend nights (Saturday or Sunday).
`no_of_week_nights`	Number of week nights (Monday to Friday).
`type_of_meal_plan`	Type of meal plan included in the booking.
`required_car_parking_space`	Whether a car parking space is required.
`room_type_reserved`	The type of room reserved.
`lead_time`	Number of days before the arrival date the booking was made.
`arrival_year`	Year of arrival.
`arrival_month`	Month of arrival.
`arrival_date`	Date of the month for arrival.
`market_segment_type`	How the booking was made.
`repeated_guest`	Whether the guest has previously stayed at the hotel.
`no_of_previous_cancellations`	Number of previous cancellations.
`no_of_previous_bookings_not_canceled`	Number of previous bookings that were canceled.
`avg_price_per_room`	Average price per day of the booking.
`no_of_special_requests`	Count of special requests made as part of the booking.
`booking_status`	Whether the booking was cancelled or not.

Source (data has been modified): https://www.kaggle.com/datasets/ahsan81/hotel-reservations-classification-dataset

The Challenge

Use your skills to produce recommendations for the hotel on what factors affect whether customers cancel their booking.

Judging Criteria

CATEGORY	WEIGHTING	DETAILS
Recommendations	35%	Clarity of recommendations - how clear and well presented the recommendation is. Quality of recommendations - are appropriate analytical techniques used & are the conclusions valid? Number of relevant insights found for the target audience.
Storytelling	35%	How well the data and insights are connected to the recommendation. How the narrative and whole report connects together. Balancing making the report in-depth enough but also concise.
Visualizations	20%	Appropriateness of visualization used. Clarity of insight from visualization.
Votes	10%	Up voting - most upvoted entries get the most points.

Checklist before publishing

Rename your workspace to make it descriptive of your work. N.B. you should leave the notebook name as notebook.ipynb.
Remove redundant cells like the judging criteria, so the workbook is focused on your work.
Check that all the cells run without error.

Time is ticking. Good luck!

source('https://raw.githubusercontent.com/ribeiropvb/Util/main/install_packages_if_not_exist.R')
source('https://raw.githubusercontent.com/ribeiropvb/Util/main/comb_lv1.R')

packages <- c(
	'VIM', 'fields', 'janitor', 'mice', 'magrittr', 'missForest'
    , 'MASS', 'corrplot', 'car', 'hnp', 'statmod', 'rsample'
	, 'pacman', 'caret', 'pROC', 'mltools', 'gridExtra'
	, 'patchwork', 'ggstatsplot', 'tornado'
)
install_packages_if_not_exist(packages)

pacman::p_load(
	tidyverse, VIM, fields, janitor, mice, magrittr
	, MASS, corrplot, car, hnp, statmod, rsample
	, caret, pROC, mltools, patchwork, tornado
)

theme_default <- theme_light()+
  theme(
    panel.grid.major = element_blank()
    , panel.grid.minor = element_blank()
  )

hotels <- readr::read_csv('data/hotel_bookings.csv', show_col_types = FALSE)
hotels <- hotels[,-1]

hotels %<>% mutate_if(is.character, as.factor) 
hotels %>% str()

NA Analysis

First of all will check the data's cleanliness and completeness. To address the issue of missing values, the analysis began by checking for the presence of "NA". Once confirmed, the prevalence of missing values was evaluated to determine their impact on the data.

The percentage of missing values for each variable was calculated, and a visual examination of the distribution of missing values was conducted across the dataset to identify patterns and the randomness of the NA's. Based on the analysis, a strategy for handling missing values was developed, taking into consideration their prevalence and the nature of the data, with the use of the package VIM.

We choose to use the knn algorith to input the data. This strategy was employed to ensure the validity and reliability of subsequent analyses.

fields::stats(hotels) %>% t() %>%
	.[,c(1:3,9)] %>% as.data.frame() %>%
	rownames_to_column() %>%
	as_tibble() %>%
	janitor::clean_names() %>% drop_na() %>% 
	mutate( na_prevalence = missing_values/n*100 )
hotels %>% VIM::aggr()

Based on the plot above, it can be observed that there is a relatively small proportion of missing values in each column, and there doesn't seem to be any discernible pattern or rule governing the presence of missing values. Therefore, it is reasonable to assume that the missing values follow a random pattern.

To impute the missing values in the data, we can use the KNN (K-Nearest Neighbors) method, which is implemented in the VIM package. By using the KNN method, we predicted the missing values and impute them in the data.

t <- Sys.time()
data <- hotels %>% 
  VIM::kNN() %>% 
  dplyr::select(-ends_with('imp'))
Sys.time()-t

fields::stats(data) %>% t() %>%
	.[,c(1:3,9)] %>% as.data.frame() %>%
	rownames_to_column() %>%
	as_tibble() %>%
	janitor::clean_names() %>%
	mutate( na_prevalence = missing_values/n ) %>% 
	na.omit()
data %>% VIM::aggr()

The imputed values didn't have a significative diference on the observed values, therefore we'll use the data without NA's.

percent_cancel <- inner_join(
    data %>% group_by(arrival_month, booking_status) %>% 
        summarise( n_cancel = n()) %>% 
        filter(booking_status == 'Canceled'),
    data %>% group_by(arrival_month) %>% summarise( n_booking = n())
    , by = 'arrival_month'
) %>% mutate(
	percent_cancel = round(n_cancel/n_booking*100, 2)
    , abbr_month = month.abb[arrival_month]
)

‌
‌
‌