Skip to content

What factors lead to guests cancelling their hotel bookings?

Predicting Hotel Cancellations

🏨 Background of the analysis

You are supporting a hotel with a project aimed to increase revenue from their room bookings. They believe that they can use data science to help them reduce the number of cancellations. This is where you come in!

They have asked you to use any appropriate methodology to identify what contributes to whether a booking will be fulfilled or cancelled. They intend to use the results of your work to reduce the chance someone cancels their booking.

The Data

They have provided you with their bookings data in a file called hotel_bookings.csv, which contains the following:

ColumnDescription
Booking_IDUnique identifier of the booking.
no_of_adultsThe number of adults.
no_of_childrenThe number of children.
no_of_weekend_nightsNumber of weekend nights (Saturday or Sunday).
no_of_week_nightsNumber of week nights (Monday to Friday).
type_of_meal_planType of meal plan included in the booking.
required_car_parking_spaceWhether a car parking space is required.
room_type_reservedThe type of room reserved.
lead_timeNumber of days before the arrival date the booking was made.
arrival_yearYear of arrival.
arrival_monthMonth of arrival.
arrival_dateDate of the month for arrival.
market_segment_typeHow the booking was made.
repeated_guestWhether the guest has previously stayed at the hotel.
no_of_previous_cancellationsNumber of previous cancellations.
no_of_previous_bookings_not_canceledNumber of previous bookings that were canceled.
avg_price_per_roomAverage price per day of the booking.
no_of_special_requestsCount of special requests made as part of the booking.
booking_statusWhether the booking was cancelled or not.

Source (data has been modified): https://www.kaggle.com/datasets/ahsan81/hotel-reservations-classification-dataset

import pandas as pd
hotels = pd.read_csv("data/hotel_bookings.csv")
hotels
# Let's look at the total number or rows and columns for this dataset

hotels.shape

Looking for missing values...

# Looking for missing values...

hotels.isna().sum()

4 hidden cells

What to do with the missing values?

A quick look at total missing values reveal the following as Top 3 columns with missing values:

required_car_parking_space

market_segment_type

room_type_reserved

The codes above explored all possible unique values we can find in those columns.

As the values are not dependant upon other variables (e.g. a guest who requires parking space does not indicate he/she prefers Room Type 1 over the others, etc), it is not suitable to replace the missing values with the median, mean, or extrapolated value with the given information.

The following operation will drop rows with missing values.

This takes the total number of rows from 36,275 to 27511 rows, 75.8% of the original amount.

hotels.dropna(inplace=True)

print(hotels)

hotels.shape

1 hidden cell

Next, we look at the number of cancelled vs not cancelled reservations

import seaborn as sns
import matplotlib.pyplot as plt

sns.countplot(x='booking_status', data=hotels)
plt.title("Total number of Cancelled vs Not Cancelled Hotel bookings")

booking_count = hotels['booking_status'].value_counts()
print(booking_count)

About one-third of our data contains rows with cancelled reservations.

For context, we use the describe() function to have a statistical view of each column in our dataset.

For example, the average lead time (or no. of days ahead of check-in day) shows 85 days, but the standard deviation is large which indicates a very wide distribution of values. True enough, the median (value at 50%) is 57 days while the maximum recorded is 443 days.

# statistical view of the values

hotels.describe()

We will now focus on the rows with cancelled reservations.

The findings suggest that lead time numbers for cancelled bookings are higher overall, when contrasted against lead time of reservations resulting in no cancellations. We shall see this below.

The median lead time for cancelled reservations is 57 days, while the median for non-cancelled reservations is 38 days.