Predicting Hotel Cancellations
🏨 Background
You are supporting a hotel with a project aimed to increase revenue from their room bookings. They believe that they can use data science to help them reduce the number of cancellations. This is where you come in!
They have asked you to use any appropriate methodology to identify what contributes to whether a booking will be fulfilled or cancelled. They intend to use the results of your work to reduce the chance someone cancels their booking.
The Data
They have provided you with their bookings data in a file called hotel_bookings.csv
, which contains the following:
Column | Description |
---|---|
Booking_ID | Unique identifier of the booking. |
no_of_adults | The number of adults. |
no_of_children | The number of children. |
no_of_weekend_nights | Number of weekend nights (Saturday or Sunday). |
no_of_week_nights | Number of week nights (Monday to Friday). |
type_of_meal_plan | Type of meal plan included in the booking. |
required_car_parking_space | Whether a car parking space is required. |
room_type_reserved | The type of room reserved. |
lead_time | Number of days before the arrival date the booking was made. |
arrival_year | Year of arrival. |
arrival_month | Month of arrival. |
arrival_date | Date of the month for arrival. |
market_segment_type | How the booking was made. |
repeated_guest | Whether the guest has previously stayed at the hotel. |
no_of_previous_cancellations | Number of previous cancellations. |
no_of_previous_bookings_not_canceled | Number of previous bookings that were canceled. |
avg_price_per_room | Average price per day of the booking. |
no_of_special_requests | Count of special requests made as part of the booking. |
booking_status | Whether the booking was cancelled or not. |
Source (data has been modified): https://www.kaggle.com/datasets/ahsan81/hotel-reservations-classification-dataset
import pandas as pd
hotels = pd.read_csv("data/hotel_bookings.csv")
hotels
The Challenge
- Use your skills to produce recommendations for the hotel on what factors affect whether customers cancel their booking.
Note:
To ensure the best user experience, we currently discourage using Folium and Bokeh in Workspace notebooks.
Judging Criteria
CATEGORY | WEIGHTING | DETAILS |
---|---|---|
Recommendations | 35% |
|
Storytelling | 35% |
|
Visualizations | 20% |
|
Votes | 10% |
|
Checklist before publishing
- Rename your workspace to make it descriptive of your work. N.B. you should leave the notebook name as notebook.ipynb.
- Remove redundant cells like the judging criteria, so the workbook is focused on your work.
- Check that all the cells run without error.
Time is ticking. Good luck!
In order to produce recommendations for the hotel on what factors affect whether customers cancel their booking, we will have to investigte the data on multiple levels. The initial phase would be to compile the relevant columns, that would protrude to having an affect on cancelations, and investigate the associated descriptive statistics.
import pandas as pd
hotels = pd.read_csv("data/hotel_bookings.csv")
hotels.shape
hotels.columns
#Calculate the % of Hotel reservations that have been canceled
booking_count = hotels["booking_status"].value_counts()
print(booking_count)
pct_canceled = booking_count["Canceled"] / \
(booking_count["Not_Canceled"] + booking_count["Canceled"])
pct_canceled = round(pct_canceled, 4)*100
print(pct_canceled)
#Groupby Year to determine which years have seen the highest cancelations
year_canceled = hotels.groupby("arrival_year")["booking_status"].value_counts(normalize=True)
print(round(year_canceled*100, 2))
#binary booking status
# create a dictionary to map categories to binary values
booking_status_map = {'Canceled': 0, 'Not_Canceled': 1}
# apply the mapping to the booking_status column
hotels['booking_status_binary'] = hotels['booking_status'].map(booking_status_map)
print(hotels)
#Correlation Matric
numeric_cols = hotels.select_dtypes(include=['int', 'float']).columns
correlation_matrix = hotels[numeric_cols].corr()
print(correlation_matrix)
import seaborn as sns
#Make concise Correlation table for visualisation
c_drop = ["no_of_adults", "no_of_children","no_of_weekend_nights", "no_of_week_nights", "required_car_parking_space","arrival_month","arrival_date", "no_of_previous_cancellations","no_of_previous_bookings_not_canceled"]
c_correlation_matrix = correlation_matrix.drop(c_drop, axis=0)
c_correlation_matrix = c_correlation_matrix.drop(c_drop, axis=1)
ax = sns.heatmap(c_correlation_matrix, annot=True, cmap='coolwarm', linewidth=1, robust=True)
ax.set_xticklabels(ax.get_xticklabels(), rotation=45)
Chi-Square test of independence is most commonly used to test association between two categorical variables. The output gives us p-value, degrees of freedom and expected values. Code for checking correlation between TWO categorical variables is easily available