Predicting Hotel Cancellations
🏨 Background
You are supporting a hotel with a project aimed to increase revenue from their room bookings. They believe that they can use data science to help them reduce the number of cancellations. This is where you come in!
They have asked you to use any appropriate methodology to identify what contributes to whether a booking will be fulfilled or cancelled. They intend to use the results of your work to reduce the chance someone cancels their booking.
The Challenge
- Use your skills to produce recommendations for the hotel on what factors affect whether customers cancel their booking.
Executive summary
Goal
Identify what contributes to a booking being fulfilled or cancelled.
Analysis
The analysis of the hotel bookings dataset was done in several steps in order to identify insights and contributing factors to booking cancelations. These steps were:
- Dataset cursory analysis - check for missing values, types of values available
- Exploratory data analysis - identifying features most associated with the booking cancelations
- Missing values and feature engineering - imputed missing values based on available data, in order to reasonably limit the number of missing values. Engineer new features based on existing ones - such as a date when a guest created a booking.
- Model hotel cancelations - deploy a classification model to confirm features most contributing to hotel cancelations.
Results and recommendations
Features contributing to booking being cancelled:
- Lead time from the booking date to the arrival date. The longer the lead time, the higher the chance of booking being cancelled
- Price per room
- Date of booking - bookings made ealier in the year have a higher chance of being cancelled
- Bookings made online have a higher chance of cancellation
Features contributing to booking being fufilled:
- Number of special requests above 3
- Repeated guest
- Corporate and complementary bookings have the lowest proportion of cancellation, but also are not a significant market segment.
Recomendations
- For longer lead time to arrival, offer more attractive room prices and/or additional perks (such as free parking)
- Consider seasonality of offering discounts and perks - in the first part of the year, offer better deals than later.
- Expand into the corporate market segment. Corporate customers tend not to cancel their reservations.
- Questionnaire when a booking is canceled - to have more concrete information about the reasons for cancellation
The Dataset
They have provided you with their bookings data in a file called hotel_bookings.csv
, which contains the following:
Column | Description |
---|---|
Booking_ID | Unique identifier of the booking. |
no_of_adults | The number of adults. |
no_of_children | The number of children. |
no_of_weekend_nights | Number of weekend nights (Saturday or Sunday). |
no_of_week_nights | Number of week nights (Monday to Friday). |
type_of_meal_plan | Type of meal plan included in the booking. |
required_car_parking_space | Whether a car parking space is required. |
room_type_reserved | The type of room reserved. |
lead_time | Number of days before the arrival date the booking was made. |
arrival_year | Year of arrival. |
arrival_month | Month of arrival. |
arrival_date | Date of the month for arrival. |
market_segment_type | How the booking was made. |
repeated_guest | Whether the guest has previously stayed at the hotel. |
no_of_previous_cancellations | Number of previous cancellations. |
no_of_previous_bookings_not_canceled | Number of previous bookings that were canceled. |
avg_price_per_room | Average price per day of the booking. |
no_of_special_requests | Count of special requests made as part of the booking. |
booking_status | Whether the booking was cancelled or not. |
Source (data has been modified): https://www.kaggle.com/datasets/ahsan81/hotel-reservations-classification-dataset
import pandas as pd
hotels = pd.read_csv("data/hotel_bookings.csv")
hotels
# import common modules
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
hotels.info()
Only Booking_ID
and booking_status
do not have missing values.Booking_ID
does not bring additional information as it is the unique ID. Let's drop it from further analysis. In addition, we see the first record having all features as null
. So let's also remove any other records, which have all null
features.
Note: once we start with dataset exploration, there may be more identified features, which will have limited impact on our label, so we may be able to drop those before the start of modeling.
hotels.drop('Booking_ID', axis = 1, inplace=True)
hotels = hotels[~hotels.drop(['booking_status'], axis=1).isnull().all(axis=1)]
General association of all features
Without doing anything with the dataset, let's explore any potential associations which may exist, taking into account that it is the label booking_status
which is of interest for our prediction.
To explore the associations in the dataset which contains a set of numerical and categorical variables, let's use the dython
package, which allows to quantify associations amongst all types of features together. For more information about dython package, visit this site.
The associations function returns a dictionary object, where 'corr' is a dictionary entry containing the correlations dataset. So we can extract the values directly from this dataframe and compare associations of different features with the booking_status
label.
%%capture
!pip install dython
from dython.nominal import associations
import matplotlib.patches as patches
associations_result= associations(hotels,figsize=(10,8), fmt='.1f', multiprocessing=True, hide_rows='booking_status', nan_strategy='drop_samples');
#rect = patches.Rectangle((0, 0), 40, 40, linewidth=3, edgecolor='r', facecolor='r')
#ax.add_patch(rect)
The results in the associations heatmap above ignore any NaN
value. The following associations can be identified:
- Strong association between
room_type_reserved
andno_of_children
- Association betwen
room_type_reserved
andavg_price_per_room
- Negative association between
arrival_month
andarrival_year
- Association amongst
no_of_previous_bookings_not_cancelled
,repeated_guest
andno_of_previous_cancellations
- And
booking_status
(our label of interest) seems to have an association withlead_time
These relationships could be used also during imputation of missing values.
Value datatypes
By looking at the table above with value types, it is possible to identify which variables should be considered probably as categorical, and which should be numerical. In fact there are several features with a datatype float64
already. However, these represent floating point numbers, and we can already assume that no_of_adults
or no_of_childen
can only be a whole number. In fact, other than avg_price_per_room
, all should be represented by an integer number.
Before we proceed, let's convert categorical variables to categorical
datatype.