Skip to content
How to deal with hotel booking cancellation?
How to deal with booking cancellation?
Without a doubt, one of the most challenging problems of hospitality industry is booking cancellation. An increase in the rate of hotel booking cancellation caused concerns in hospitality industry. If the customers cancel their bookings, this means that hotels lose their revnue and opportunity to sell their rooms. As a data science team, we are going to develop model to predict booking cancellation in order to minimize loss.
The Data
They have provided you with their bookings data in a file called hotel_bookings.csv, which contains the following:
| Column | Description |
|---|---|
Booking_ID | Unique identifier of the booking. |
no_of_adults | The number of adults. |
no_of_children | The number of children. |
no_of_weekend_nights | Number of weekend nights (Saturday or Sunday). |
no_of_week_nights | Number of week nights (Monday to Friday). |
type_of_meal_plan | Type of meal plan included in the booking. |
required_car_parking_space | Whether a car parking space is required. |
room_type_reserved | The type of room reserved. |
lead_time | Number of days before the arrival date the booking was made. |
arrival_year | Year of arrival. |
arrival_month | Month of arrival. |
arrival_date | Date of the month for arrival. |
market_segment_type | How the booking was made. |
repeated_guest | Whether the guest has previously stayed at the hotel. |
no_of_previous_cancellations | Number of previous cancellations. |
no_of_previous_bookings_not_canceled | Number of previous bookings that were canceled. |
avg_price_per_room | Average price per day of the booking. |
no_of_special_requests | Count of special requests made as part of the booking. |
booking_status | Whether the booking was cancelled or not. |
Source (data has been modified): https://www.kaggle.com/datasets/ahsan81/hotel-reservations-classification-dataset
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from datetime import date, timedelta
hotels = pd.read_csv("data/hotel_bookings.csv")
hotels.head(10)Key insights
- Online and Offline market segment contribute over 90% of bookings. However, 95% of canceled bookings result from Online and Offline market segment.
- Online, Offline and Other market segment have cancellation rate above overall cancellation rate.
- About 80% of customers made booking prior check-in less than 6 months. Surprisingly, customers who made booking prior check-in over 6 months have a higher chance of booking cancellation than those who made booking prior check-in less than 6 months
- There was a significant increase in booking cancellation in both July 2017 and July 2018. Cancellation rate reached the peak at 40%.
- Customers usually get a better booking price if they make booking prior check-in for a long time.
- It is no denying the fact that the longer lead time booking is made, the cheaper price they get. The customers who canceled booking tend to get higher price than who did not no matter how long they made booking prior check-in.
- Customers who did not make any special requests have 40% chance of booking cancellation which is greater than those who did not up to 20%.
- Customers who made booking for long night stay (over 3 weeks) have almost 80% chance of booking cancellation
- Cancellation are likely on Sunday with 37% of cancellation rate
Handling missing values
Drop data:
- Drop data which most of field values contain missing values (one record)
- Drop data which arrival_year, arrival_month or arrival_date is missing (4 percent of data) as I will explore the timeline of bookings
- Other columns, we will fill missing values data with different logic
Imputation:
- Fill market_segment_type missing value with Other
- Fill type_of_meal_plan missing value with mode
- Fill no_of_week_nights missing value with median
- Fill no_of_weekend_nights missing value with median
- Fill no_of_special_requests missing value with median
- Fill no_of_adults missing value with median
- Fill no_of_children missing value with median
- Fill required_car_parking_space missing value with mode
- Fill room_type_reserved missing value with mode
- Fill repeated_guest missing value with criteria as below:
- if no_of_previous_bookings_not_canceled equal to zero and no_of_previous_bookings_not_canceled is not null then repeated_guest equal to one else set repeated_guest to zero
- Fill no_of_previous_cancellations missing value with median
- Fill no_of_previous_bookings_not_canceled missing value with criteria as below:
- if repeated_guest equal to zero then zero else fill missing value with median
# Drop data where the record contain missing value
hotels = hotels[~(hotels.isnull().sum(axis = 1) == 17)]
# As we would like to explore the demand of hotel in each time period, we will drop the records are missing at least one of the following columns: arrival_year, arrival_month and arrival_date
hotels = hotels[~hotels[["arrival_year", "arrival_month", "arrival_date"]].isnull().any(axis = 1)]hotels["lead_time"].fillna(hotels["lead_time"].median(), inplace = True)
hotels["lead_time_in_month"] = hotels["lead_time"].apply(lambda x: round(x / 30) if ~np.isnan(x) else x)
hotels["market_segment_type"].fillna("Other", inplace = True)
hotels["type_of_meal_plan"].fillna("Not Selected", inplace = True)
hotels["no_of_weekend_nights"].fillna(hotels["no_of_weekend_nights"].mode()[0], inplace = True)
hotels["no_of_week_nights"].fillna(hotels["no_of_week_nights"].mode()[0], inplace = True)
hotels["no_of_special_requests"].fillna(hotels["no_of_special_requests"].median(), inplace = True)
hotels["no_of_adults"].fillna(hotels["no_of_adults"].mode()[0], inplace = True)
hotels["no_of_children"].fillna(hotels["no_of_children"].mode()[0], inplace = True)
hotels["required_car_parking_space"].fillna(hotels["required_car_parking_space"].mode()[0], inplace = True)
hotels["room_type_reserved"].fillna(hotels["room_type_reserved"].mode()[0], inplace = True)
hotels.head()def replace_value(group, col, value):
if group == 0 and np.isnan(value):
return 0
elif group == 1 and np.isnan(value):
return hotels[hotels["repeated_guest"] == group].loc[:, col].median()
else:
return value
hotels["repeated_guest"] = hotels.apply(
lambda row: 1 if row["no_of_previous_bookings_not_canceled"] != 0 and ~np.isnan(row["no_of_previous_bookings_not_canceled"]) else 0.0, axis = 1)
hotels["no_of_previous_cancellations"].fillna(hotels["no_of_previous_cancellations"].median(), inplace = True)
hotels["no_of_previous_bookings_not_canceled"] = hotels.apply(
lambda row: replace_value(row["repeated_guest"], "no_of_previous_bookings_not_canceled", row["no_of_previous_bookings_not_canceled"]), axis = 1
)
hotels["avg_price_per_room"] = hotels.apply(
lambda row: replace_value(row["repeated_guest"], "avg_price_per_room", row["avg_price_per_room"]), axis = 1
)def to_date(y, m, d):
if ~np.isnan(y) and ~np.isnan(m) and ~np.isnan(d):
try:
arrival_date = date(round(y), round(m), round(d))
except:
arrival_date = date(round(y), round(m), round(d - 1))
return arrival_date
else:
return np.nan
hotels["date"] = hotels.apply(lambda row:
to_date(row["arrival_year"], row["arrival_month"], row["arrival_date"]),
axis = 1
)
hotels["checkin_date"] = pd.to_datetime(hotels["date"])
hotels["checkin_month"] = hotels["checkin_date"].dt.strftime("%Y-%m")
hotels["checkin_weekday"] = hotels["checkin_date"].dt.weekday
hotels["checkin_dayname"] = hotels["checkin_date"].dt.day_name()Exploratory Data Analysis
How often do customers canceled the bookings?
First, we will see how often customers canceled bookings
Hidden code