Predicting Hotel Cancellations
🏨 Background
You are supporting a hotel with a project aimed to increase revenue from their room bookings. They believe that they can use data science to help them reduce the number of cancellations. This is where you come in!
They have asked you to use any appropriate methodology to identify what contributes to whether a booking will be fulfilled or cancelled. They intend to use the results of your work to reduce the chance someone cancels their booking.
The Data
They have provided you with their bookings data in a file called hotel_bookings.csv, which contains the following:
| Column | Description |
|---|---|
Booking_ID | Unique identifier of the booking. |
no_of_adults | The number of adults. |
no_of_children | The number of children. |
no_of_weekend_nights | Number of weekend nights (Saturday or Sunday). |
no_of_week_nights | Number of week nights (Monday to Friday). |
type_of_meal_plan | Type of meal plan included in the booking. |
required_car_parking_space | Whether a car parking space is required. |
room_type_reserved | The type of room reserved. |
lead_time | Number of days before the arrival date the booking was made. |
arrival_year | Year of arrival. |
arrival_month | Month of arrival. |
arrival_date | Date of the month for arrival. |
market_segment_type | How the booking was made. |
repeated_guest | Whether the guest has previously stayed at the hotel. |
no_of_previous_cancellations | Number of previous cancellations. |
no_of_previous_bookings_not_canceled | Number of previous bookings that were canceled. |
avg_price_per_room | Average price per day of the booking. |
no_of_special_requests | Count of special requests made as part of the booking. |
booking_status | Whether the booking was cancelled or not. |
Source (data has been modified): https://www.kaggle.com/datasets/ahsan81/hotel-reservations-classification-dataset
import pandas as pd
data = pd.read_csv("data/hotel_bookings.csv")
data1 hidden cell
Next Steps
- Data Cleaning: Handle missing values appropriately.
- Exploratory Data Analysis (EDA): Understand the distribution and relationships between different variables.
- Feature Engineering: Create any new features that might help in predicting cancellations.
- Modeling: Build a predictive model to identify factors affecting cancellations.
- Recommendations: Based on the model and analysis, provide actionable recommendations to reduce cancellations.
Let's start with data cleaning and handling missing values.
# Checking the number of missing values in each column
missing_values = data.isnull().sum()
# Displaying columns with missing values
missing_values = missing_values[missing_values > 0]
missing_valuesThe dataset has missing values in several columns. Here is a summary of the number of missing values for each affected column:
no_of_adults: 413 missing valuesno_of_children: 324 missing valuesno_of_weekend_nights: 367 missing valuesno_of_week_nights: 807 missing valuestype_of_meal_plan: 526 missing valuesrequired_car_parking_space: 2592 missing valuesroom_type_reserved: 1171 missing valueslead_time: 472 missing valuesarrival_year: 378 missing valuesarrival_month: 504 missing valuesarrival_date: 981 missing valuesmarket_segment_type: 1512 missing valuesrepeated_guest: 586 missing valuesno_of_previous_cancellations: 497 missing valuesno_of_previous_bookings_not_canceled: 550 missing valuesavg_price_per_room: 460 missing valuesno_of_special_requests: 789 missing values
Data Cleaning Strategy:
- Identify and remove rows with critical missing values that cannot be imputed reasonably.
- Impute missing values for columns where imputation makes sense, using mean/median for numerical values and mode for categorical values.
- Check for any inconsistencies in the data after handling missing values.
Let's start by analyzing which rows should be removed and which can be imputed.
# Dropping rows where critical information (such as Booking_ID) is missing
# As Booking_ID is unique, any rows with missing values here should be dropped, but there are none as seen from the earlier data.
# Dropping rows with missing values in all columns
data_cleaned = data.dropna(how='all')
# Imputing missing numerical values with median
numerical_cols = ['no_of_adults', 'no_of_children', 'no_of_weekend_nights', 'no_of_week_nights',
'lead_time', 'arrival_year', 'arrival_month', 'arrival_date',
'no_of_previous_cancellations', 'no_of_previous_bookings_not_canceled',
'avg_price_per_room', 'no_of_special_requests']
for col in numerical_cols:
data_cleaned[col] = data_cleaned[col].fillna(data_cleaned[col].median())
# Imputing missing categorical values with mode
categorical_cols = ['type_of_meal_plan', 'required_car_parking_space', 'room_type_reserved',
'market_segment_type', 'repeated_guest']
for col in categorical_cols:
data_cleaned[col] = data_cleaned[col].fillna(data_cleaned[col].mode()[0])
# Checking if there are any remaining missing values
remaining_missing_values = data_cleaned.isnull().sum().sum()
remaining_missing_valuesAll missing values have been successfully handled. The next steps involve exploratory data analysis (EDA) to understand the distribution of the data and the relationships between variables.
Exploratory Data Analysis (EDA):
- Descriptive Statistics: Summary statistics for numerical and categorical variables.
- Visualizations: Histograms, box plots, and bar charts to understand distributions and relationships.
- Correlation Analysis: To see which numerical features are correlated with cancellations.
Let's start with descriptive statistics and some basic visualizations.
import matplotlib.pyplot as plt
import seaborn as sns
# Descriptive statistics for numerical variables
numerical_stats = data_cleaned.describe()
# Descriptive statistics for categorical variables
categorical_stats = data_cleaned.describe(include=['object'])
# Displaying numerical and categorical stats
numerical_stats, categorical_statsThe descriptive statistics provide a good overview of the numerical and categorical variables in the dataset:
Numerical Variables
no_of_adults: Mostly 2 adults per booking.no_of_children: Mostly 0 children per booking.no_of_weekend_nights: Usually 1 or 2 weekend nights per booking.no_of_week_nights: Usually around 2 week nights per booking.required_car_parking_space: Very few bookings require a car parking space.lead_time: Varies widely, with an average of 85 days.arrival_year: Mostly bookings for 2018.arrival_month: Fairly evenly distributed across months.arrival_date: Fairly evenly distributed across days.repeated_guest: Very few repeated guests.no_of_previous_cancellations: Very few previous cancellations.no_of_previous_bookings_not_canceled: Very few previous bookings not canceled.avg_price_per_room: Average price around 103.no_of_special_requests: Usually 0 or 1 special request.
Categorical Variables
type_of_meal_plan: Most common is "Meal Plan 1".room_type_reserved: Most common is "Room_Type 1".market_segment_type: Most bookings are made online.booking_status: Majority of bookings are not canceled.
Next Steps
- Visualizations:
- Histogram for numerical variables to understand their distributions.
- Bar charts for categorical variables to see the frequency of categories.
- Box plots to visualize the relationship between numerical variables and
booking_status.
- Correlation Analysis:
- Check correlations between numerical variables and
booking_status.
- Check correlations between numerical variables and
Let's start with the visualizations.
# Histograms for numerical variables
numerical_cols = ['no_of_adults', 'no_of_children', 'no_of_weekend_nights', 'no_of_week_nights',
'lead_time', 'avg_price_per_room', 'no_of_special_requests']
data_cleaned[numerical_cols].hist(bins=30, figsize=(15, 10))
plt.tight_layout()
plt.show()The histograms provide a visual understanding of the distributions of the numerical variables:
- No. of Adults: Most bookings have 2 adults.
- No. of Children: Most bookings have no children.
- No. of Weekend Nights: Most bookings include 1 or 2 weekend nights.
- No. of Week Nights: Most bookings include 1 to 3 week nights.
- Lead Time: There is a wide range, with many bookings made around 0-50 days before arrival, and a notable peak around 200-300 days.
- Avg Price per Room: Prices vary widely, with a concentration around 50-150.
- No. of Special Requests: Most bookings have no special requests, and fewer have 1 or 2 special requests.
Next, let's visualize the categorical variables.
# Bar charts for categorical variables
categorical_cols = ['type_of_meal_plan', 'required_car_parking_space', 'room_type_reserved', 'market_segment_type', 'booking_status']
fig, axes = plt.subplots(3, 2, figsize=(15, 15))
for i, col in enumerate(categorical_cols):
ax = axes.flatten()[i]
sns.countplot(data=data_cleaned, x=col, ax=ax)
ax.set_title(f'Distribution of {col}')
ax.set_xlabel('')
ax.set_ylabel('Count')
ax.tick_params(axis='x', rotation=45) # Set X axis labels to 45 degree angle
# Remove the empty subplot
fig.delaxes(axes[2][1])
plt.tight_layout()
plt.show()The bar charts for the categorical variables reveal the following:
- Type of Meal Plan: Majority of bookings have "Meal Plan 1" or "Not Selected".
- Required Car Parking Space: Very few bookings require a car parking space.
- Room Type Reserved: "Room_Type 1" is the most commonly reserved room type.
- Market Segment Type: Most bookings are made online.
- Booking Status: Majority of bookings are not canceled.
Next, let's visualize the relationship between numerical variables and booking_status using box plots.