Skip to content
0

How to deal with booking cancellation?

Without a doubt, one of the most challenging problems of hospitality industry is booking cancellation. An increase in the rate of hotel booking cancellation caused concerns in hospitality industry. If the customers cancel their bookings, this means that hotels lose their revnue and opportunity to sell their rooms. As a data science team, we are going to develop model to predict booking cancellation in order to minimize loss.

The Data

They have provided you with their bookings data in a file called hotel_bookings.csv, which contains the following:

ColumnDescription
Booking_IDUnique identifier of the booking.
no_of_adultsThe number of adults.
no_of_childrenThe number of children.
no_of_weekend_nightsNumber of weekend nights (Saturday or Sunday).
no_of_week_nightsNumber of week nights (Monday to Friday).
type_of_meal_planType of meal plan included in the booking.
required_car_parking_spaceWhether a car parking space is required.
room_type_reservedThe type of room reserved.
lead_timeNumber of days before the arrival date the booking was made.
arrival_yearYear of arrival.
arrival_monthMonth of arrival.
arrival_dateDate of the month for arrival.
market_segment_typeHow the booking was made.
repeated_guestWhether the guest has previously stayed at the hotel.
no_of_previous_cancellationsNumber of previous cancellations.
no_of_previous_bookings_not_canceledNumber of previous bookings that were canceled.
avg_price_per_roomAverage price per day of the booking.
no_of_special_requestsCount of special requests made as part of the booking.
booking_statusWhether the booking was cancelled or not.

Source (data has been modified): https://www.kaggle.com/datasets/ahsan81/hotel-reservations-classification-dataset

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from datetime import date, timedelta

hotels = pd.read_csv("data/hotel_bookings.csv")
hotels.head(10)

Key insights

  • Online and Offline market segment contribute over 90% of bookings. However, 95% of canceled bookings result from Online and Offline market segment.
  • Online, Offline and Other market segment have cancellation rate above overall cancellation rate.
  • About 80% of customers made booking prior check-in less than 6 months. Surprisingly, customers who made booking prior check-in over 6 months have a higher chance of booking cancellation than those who made booking prior check-in less than 6 months
  • There was a significant increase in booking cancellation in both July 2017 and July 2018. Cancellation rate reached the peak at 40%.
  • Customers usually get a better booking price if they make booking prior check-in for a long time.
  • It is no denying the fact that the longer lead time booking is made, the cheaper price they get. The customers who canceled booking tend to get higher price than who did not no matter how long they made booking prior check-in.
  • Customers who did not make any special requests have 40% chance of booking cancellation which is greater than those who did not up to 20%.
  • Customers who made booking for long night stay (over 3 weeks) have almost 80% chance of booking cancellation
  • Cancellation are likely on Sunday with 37% of cancellation rate

Handling missing values

Drop data:

  • Drop data which most of field values contain missing values (one record)
  • Drop data which arrival_year, arrival_month or arrival_date is missing (4 percent of data) as I will explore the timeline of bookings
  • Other columns, we will fill missing values data with different logic

Imputation:

  • Fill market_segment_type missing value with Other
  • Fill type_of_meal_plan missing value with mode
  • Fill no_of_week_nights missing value with median
  • Fill no_of_weekend_nights missing value with median
  • Fill no_of_special_requests missing value with median
  • Fill no_of_adults missing value with median
  • Fill no_of_children missing value with median
  • Fill required_car_parking_space missing value with mode
  • Fill room_type_reserved missing value with mode
  • Fill repeated_guest missing value with criteria as below:
    • if no_of_previous_bookings_not_canceled equal to zero and no_of_previous_bookings_not_canceled is not null then repeated_guest equal to one else set repeated_guest to zero
  • Fill no_of_previous_cancellations missing value with median
  • Fill no_of_previous_bookings_not_canceled missing value with criteria as below:
    • if repeated_guest equal to zero then zero else fill missing value with median
# Drop data where the record contain missing value
hotels = hotels[~(hotels.isnull().sum(axis = 1) == 17)]
# As we would like to explore the demand of hotel in each time period, we will drop the records are missing at least one of the following columns: arrival_year, arrival_month and arrival_date
hotels = hotels[~hotels[["arrival_year", "arrival_month", "arrival_date"]].isnull().any(axis = 1)]
hotels["lead_time"].fillna(hotels["lead_time"].median(), inplace = True)
hotels["lead_time_in_month"] = hotels["lead_time"].apply(lambda x: round(x / 30) if ~np.isnan(x) else x)

hotels["market_segment_type"].fillna("Other", inplace = True)
hotels["type_of_meal_plan"].fillna("Not Selected", inplace = True)

hotels["no_of_weekend_nights"].fillna(hotels["no_of_weekend_nights"].mode()[0], inplace = True)
hotels["no_of_week_nights"].fillna(hotels["no_of_week_nights"].mode()[0], inplace = True)

hotels["no_of_special_requests"].fillna(hotels["no_of_special_requests"].median(), inplace = True)

hotels["no_of_adults"].fillna(hotels["no_of_adults"].mode()[0], inplace = True)
hotels["no_of_children"].fillna(hotels["no_of_children"].mode()[0], inplace = True)
hotels["required_car_parking_space"].fillna(hotels["required_car_parking_space"].mode()[0], inplace = True)
hotels["room_type_reserved"].fillna(hotels["room_type_reserved"].mode()[0], inplace = True)

hotels.head()
def replace_value(group, col, value):
    if group == 0 and np.isnan(value):
        return 0
    elif group == 1 and np.isnan(value):
        return hotels[hotels["repeated_guest"] == group].loc[:, col].median()
    else:
        return value
        
hotels["repeated_guest"] = hotels.apply(
    lambda row: 1 if row["no_of_previous_bookings_not_canceled"] != 0 and ~np.isnan(row["no_of_previous_bookings_not_canceled"]) else 0.0, axis = 1)

hotels["no_of_previous_cancellations"].fillna(hotels["no_of_previous_cancellations"].median(), inplace = True)

hotels["no_of_previous_bookings_not_canceled"] = hotels.apply(
    lambda row: replace_value(row["repeated_guest"], "no_of_previous_bookings_not_canceled", row["no_of_previous_bookings_not_canceled"]), axis = 1
)

hotels["avg_price_per_room"] = hotels.apply(
    lambda row: replace_value(row["repeated_guest"], "avg_price_per_room", row["avg_price_per_room"]), axis = 1
)
def to_date(y, m, d):
    if ~np.isnan(y) and ~np.isnan(m) and ~np.isnan(d):
        try:
            arrival_date = date(round(y), round(m), round(d))
        except:
            arrival_date = date(round(y), round(m), round(d - 1))
        return arrival_date
    else:
        return np.nan

hotels["date"] = hotels.apply(lambda row: 
                  to_date(row["arrival_year"], row["arrival_month"], row["arrival_date"]), 
                  axis = 1
              )
hotels["checkin_date"] = pd.to_datetime(hotels["date"])
hotels["checkin_month"] = hotels["checkin_date"].dt.strftime("%Y-%m")
hotels["checkin_weekday"] = hotels["checkin_date"].dt.weekday
hotels["checkin_dayname"] = hotels["checkin_date"].dt.day_name()

Exploratory Data Analysis

How often do customers canceled the bookings?

First, we will see how often customers canceled bookings

Hidden code