Skip to content

Hotel Booking Cancellations

This dataset consists of booking data from a city hotel and a resort hotel. It includes many details about the bookings, including room specifications, the length of stay, the time between the booking and the stay, whether the booking was canceled, and how the booking was made. The data was gathered between July 2015 and August 2017.

suppressPackageStartupMessages(library(tidyverse))

read_csv('data/hotel_bookings_clean.csv', show_col_types = FALSE)

Data Dictionary

For binary variables: 1 = true and 0 = false.

ColumnExplanation
is_canceledBinary variable indicating whether a booking was canceled
lead timeNumber of days between booking date and arrival date
arrival_date_week_number, arrival_date_day_of_month, arrival_date_monthWeek number, day date, and month number of arrival date
stays_in_weekend_nights, stays_in_week_nightsNumber of weekend nights (Saturday and Sunday) and weeknights (Monday to Friday) the customer booked
adults, children, babiesNumber of adults, children, babies booked for the stay
is_repeated_guestBinary variable indicating whether the customer was a repeat guest
previous_cancellationsNumber of prior bookings that were canceled by the customer
previous_bookings_not_canceledNumber of prior bookings that were not canceled by the customer
required_car_parking_spacesNumber of parking spaces requested by the customer
total_of_special_requestsNumber of special requests made by the customer
avg_daily_rateAverage daily rate, as defined by dividing the sum of all lodging transactions by the total number of staying nights
booked_by_companyBinary variable indicating whether a company booked the booking
booked_by_agentBinary variable indicating whether an agent booked the booking
hotel_CityBinary variable indicating whether the booked hotel is a "City Hotel"
hotel_ResortBinary variable indicating whether the booked hotel is a "Resort Hotel"
meal_BBBinary variable indicating whether a bed & breakfast meal was booked
meal_HBBinary variable indicating whether a half board meal was booked
meal_FBBinary variable indicating whether a full board meal was booked
meal_No_mealBinary variable indicating whether there was no meal package booked
market_segment_Aviation, market_segment_Complementary, market_segment_Corporate, market_segment_Direct, market_segment_Groups, market_segment_Offline_TA_TO, market_segment_Online_TA, market_segment_UndefinedIndicates market segment designation with a value of 1. "TA"= travel agent, "TO"= tour operators
distribution_channel_Corporate, distribution_channel_Direct, distribution_channel_GDS, distribution_channel_TA_TO, distribution_channel_UndefinedIndicates booking distribution channel with a value of 1. "TA"= travel agent, "TO"= tour operators, "GDS" = Global Distribution System
reserved_room_type_A, reserved_room_type_B, reserved_room_type_C, reserved_room_type_D, reserved_room_type_E, reserved_room_type_F, reserved_room_type_G, reserved_room_type_H, reserved_room_type_LIndicates code of room type reserved with a value of 1. Code is presented instead of designation for anonymity reasons
deposit_type_No_DepositBinary variable indicating whether a deposit was made
deposit_type_Non_RefundBinary variable indicating whether a deposit was made in the value of the total stay cost
deposit_type_RefundableBinary variable indicating whether a deposit was made with a value under the total stay cost
customer_type_ContractBinary variable indicating whether the booking has an allotment or other type of contract associated to it
customer_type_GroupBinary variable indicating whether the booking is associated to a group
customer_type_TransientBinary variable indicating whether the booking is not part of a group or contract, and is not associated to other transient booking
customer_type_Transient-PartyBinary variable indicating whether the booking is transient, but is associated to at least another transient booking

Source and license of data. The data is originally from an article called Hotel booking demand datasets by Nuno Antonio, Ana de Almeida, and Luis Nunes. It was cleaned by Thomas Mock and Antoine Bichat for #TidyTuesday during the week of February 11th, 2020.

Introduction

A chain of hotels has noticed that the cancellation rate has risen in the past few years. This often leads to rooms that are left unrented for multiple days at a time.

Hotel management is interested in developing a model to predict the likelihood that a customer will cancel their reservation. If successful, this could be used to optimize their booking service and anticipate when cancellations will occur.

Which family sizes are associated with the highest cancellations?

First looking at the relationship between current bookings and family size, we see that a family size of 2 makes up the overwhelming majority of cancellations and that cancellations are clustered between 1 and 5 family sizes.

Spinner
Data frameas
df1
variable
SELECT 
	(adults + children + babies) AS family_size,
	count(*) as num_cancellations,
FROM 'data/hotel_bookings_clean.csv'
WHERE is_canceled = 1
	AND (adults + children + babies) IS NOT NULL
GROUP BY family_size
ORDER BY num_cancellations DESC;

1 hidden cell
hist(fam_size$family_size, main = 'Cancellations by Family Size', xlab = 'Family size',
	 breaks = 100)

When looking at previous cancellations and family size, we also see that the highest numbers of previous cancellations are associated with family sizes of 2.

Spinner
Data frameas
df2
variable
SELECT
	(adults + children + babies) AS family_size,
	sum(previous_cancellations) AS sum_previous_cancellations
FROM 'data/hotel_bookings_clean.csv'
WHERE is_canceled = 1
GROUP BY family_size
ORDER BY sum_previous_cancellations DESC
LIMIT 5;

Are longer lead times more likely to result in a cancellation?

Lets categorize the lead times in the following fashion: More than 1 year, Between 6 months and 1 year, Between 3 and 6 months, Between 1 and 3 months, Less than 1 month

Grouping the data in this way shows that the majority of the cancellations are approximately evenly clustered between 1 month to 1 year lead times and that bookings with really long lead times (greater than 1 year) only make up a small percentage of cancellations.

Spinner
Data frameas
df4
variable
SELECT 
	CASE WHEN lead_time > 365 THEN 'More than 1 year'
	WHEN lead_time >= 180 AND lead_time <= 365 THEN 'Between 6 months and 1 year'
	WHEN lead_time >= 90 AND lead_time <= 180 THEN 'Between 3 and 6 months'
	WHEN lead_time >= 30 AND lead_time <= 90 THEN 'Between 1 and 3 months'
	ELSE 'Less than 1 month' END AS lead_times,
	count(*) AS num_cancellations
FROM 'data/hotel_bookings_clean.csv'
WHERE is_canceled = 1
GROUP BY lead_times, is_canceled
ORDER BY lead_times DESC
;
barplot(height = df4$num_cancellations, 
		names.arg = c('> 1yr', '< 1mo', '6mo-1yr', '3-6mo', '1-3mo'), xlab = 'lead_times', 
		ylab = 'num_cancellations', main = 'Categorizing Lead Times for Cancelled Bookings')
Spinner
Data frameas
lead_timing
variable
SELECT 
	lead_time 
FROM 'data/hotel_bookings_clean.csv'
WHERE is_canceled = 1
Hidden output