What factors lead to guests cancelling their hotel bookings?
Predicting Hotel Cancellations
🏨 Background of the analysis
You are supporting a hotel with a project aimed to increase revenue from their room bookings. They believe that they can use data science to help them reduce the number of cancellations. This is where you come in!
They have asked you to use any appropriate methodology to identify what contributes to whether a booking will be fulfilled or cancelled. They intend to use the results of your work to reduce the chance someone cancels their booking.
The Data
They have provided you with their bookings data in a file called hotel_bookings.csv
, which contains the following:
Column | Description |
---|---|
Booking_ID | Unique identifier of the booking. |
no_of_adults | The number of adults. |
no_of_children | The number of children. |
no_of_weekend_nights | Number of weekend nights (Saturday or Sunday). |
no_of_week_nights | Number of week nights (Monday to Friday). |
type_of_meal_plan | Type of meal plan included in the booking. |
required_car_parking_space | Whether a car parking space is required. |
room_type_reserved | The type of room reserved. |
lead_time | Number of days before the arrival date the booking was made. |
arrival_year | Year of arrival. |
arrival_month | Month of arrival. |
arrival_date | Date of the month for arrival. |
market_segment_type | How the booking was made. |
repeated_guest | Whether the guest has previously stayed at the hotel. |
no_of_previous_cancellations | Number of previous cancellations. |
no_of_previous_bookings_not_canceled | Number of previous bookings that were canceled. |
avg_price_per_room | Average price per day of the booking. |
no_of_special_requests | Count of special requests made as part of the booking. |
booking_status | Whether the booking was cancelled or not. |
Source (data has been modified): https://www.kaggle.com/datasets/ahsan81/hotel-reservations-classification-dataset
import pandas as pd
hotels = pd.read_csv("data/hotel_bookings.csv")
hotels
# Let's look at the total number or rows and columns for this dataset
hotels.shape
Looking for missing values...
# Looking for missing values...
hotels.isna().sum()
4 hidden cells
What to do with the missing values?
A quick look at total missing values reveal the following as Top 3 columns with missing values:
required_car_parking_space
market_segment_type
room_type_reserved
The codes above explored all possible unique values we can find in those columns.
As the values are not dependant upon other variables (e.g. a guest who requires parking space does not indicate he/she prefers Room Type 1 over the others, etc), it is not suitable to replace the missing values with the median, mean, or extrapolated value with the given information.
The following operation will drop rows with missing values.
This takes the total number of rows from 36,275 to 27511 rows, 75.8% of the original amount.
hotels.dropna(inplace=True)
print(hotels)
hotels.shape
1 hidden cell
Next, we look at the number of cancelled vs not cancelled reservations
import seaborn as sns
import matplotlib.pyplot as plt
sns.countplot(x='booking_status', data=hotels)
plt.title("Total number of Cancelled vs Not Cancelled Hotel bookings")
booking_count = hotels['booking_status'].value_counts()
print(booking_count)
About one-third of our data contains rows with cancelled reservations.
For context, we use the describe() function to have a statistical view of each column in our dataset.
For example, the average lead time (or no. of days ahead of check-in day) shows 85 days, but the standard deviation is large which indicates a very wide distribution of values. True enough, the median (value at 50%) is 57 days while the maximum recorded is 443 days.
# statistical view of the values
hotels.describe()
We will now focus on the rows with cancelled reservations.
The findings suggest that lead time numbers for cancelled bookings are higher overall, when contrasted against lead time of reservations resulting in no cancellations. We shall see this below.
The median lead time for cancelled reservations is 57 days, while the median for non-cancelled reservations is 38 days.