Skip to content
0

Predicting Hotel Cancellations

🏨 Background

You are supporting a hotel with a project aimed to increase revenue from their room bookings. They believe that they can use data science to help them reduce the number of cancellations. This is where you come in!

They have asked you to use any appropriate methodology to identify what contributes to whether a booking will be fulfilled or cancelled. They intend to use the results of your work to reduce the chance someone cancels their booking.

The Data

They have provided you with their bookings data in a file called hotel_bookings.csv, which contains the following:

ColumnDescription
Booking_IDUnique identifier of the booking.
no_of_adultsThe number of adults.
no_of_childrenThe number of children.
no_of_weekend_nightsNumber of weekend nights (Saturday or Sunday).
no_of_week_nightsNumber of week nights (Monday to Friday).
type_of_meal_planType of meal plan included in the booking.
required_car_parking_spaceWhether a car parking space is required.
room_type_reservedThe type of room reserved.
lead_timeNumber of days before the arrival date the booking was made.
arrival_yearYear of arrival.
arrival_monthMonth of arrival.
arrival_dateDate of the month for arrival.
market_segment_typeHow the booking was made.
repeated_guestWhether the guest has previously stayed at the hotel.
no_of_previous_cancellationsNumber of previous cancellations.
no_of_previous_bookings_not_canceledNumber of previous bookings that were canceled.
avg_price_per_roomAverage price per day of the booking.
no_of_special_requestsCount of special requests made as part of the booking.
booking_statusWhether the booking was cancelled or not.

Source (data has been modified): https://www.kaggle.com/datasets/ahsan81/hotel-reservations-classification-dataset

import pandas as pd
data = pd.read_csv("data/hotel_bookings.csv")
data

1 hidden cell

Next Steps

  1. Data Cleaning: Handle missing values appropriately.
  2. Exploratory Data Analysis (EDA): Understand the distribution and relationships between different variables.
  3. Feature Engineering: Create any new features that might help in predicting cancellations.
  4. Modeling: Build a predictive model to identify factors affecting cancellations.
  5. Recommendations: Based on the model and analysis, provide actionable recommendations to reduce cancellations.

Let's start with data cleaning and handling missing values.

# Checking the number of missing values in each column
missing_values = data.isnull().sum()

# Displaying columns with missing values
missing_values = missing_values[missing_values > 0]
missing_values

The dataset has missing values in several columns. Here is a summary of the number of missing values for each affected column:

  • no_of_adults: 413 missing values
  • no_of_children: 324 missing values
  • no_of_weekend_nights: 367 missing values
  • no_of_week_nights: 807 missing values
  • type_of_meal_plan: 526 missing values
  • required_car_parking_space: 2592 missing values
  • room_type_reserved: 1171 missing values
  • lead_time: 472 missing values
  • arrival_year: 378 missing values
  • arrival_month: 504 missing values
  • arrival_date: 981 missing values
  • market_segment_type: 1512 missing values
  • repeated_guest: 586 missing values
  • no_of_previous_cancellations: 497 missing values
  • no_of_previous_bookings_not_canceled: 550 missing values
  • avg_price_per_room: 460 missing values
  • no_of_special_requests: 789 missing values

Data Cleaning Strategy:

  1. Identify and remove rows with critical missing values that cannot be imputed reasonably.
  2. Impute missing values for columns where imputation makes sense, using mean/median for numerical values and mode for categorical values.
  3. Check for any inconsistencies in the data after handling missing values.

Let's start by analyzing which rows should be removed and which can be imputed.

# Dropping rows where critical information (such as Booking_ID) is missing
# As Booking_ID is unique, any rows with missing values here should be dropped, but there are none as seen from the earlier data.

# Dropping rows with missing values in all columns
data_cleaned = data.dropna(how='all')

# Imputing missing numerical values with median
numerical_cols = ['no_of_adults', 'no_of_children', 'no_of_weekend_nights', 'no_of_week_nights',
                  'lead_time', 'arrival_year', 'arrival_month', 'arrival_date',
                  'no_of_previous_cancellations', 'no_of_previous_bookings_not_canceled', 
                  'avg_price_per_room', 'no_of_special_requests']

for col in numerical_cols:
    data_cleaned[col] = data_cleaned[col].fillna(data_cleaned[col].median())

# Imputing missing categorical values with mode
categorical_cols = ['type_of_meal_plan', 'required_car_parking_space', 'room_type_reserved', 
                    'market_segment_type', 'repeated_guest']

for col in categorical_cols:
    data_cleaned[col] = data_cleaned[col].fillna(data_cleaned[col].mode()[0])

# Checking if there are any remaining missing values
remaining_missing_values = data_cleaned.isnull().sum().sum()

remaining_missing_values

All missing values have been successfully handled. The next steps involve exploratory data analysis (EDA) to understand the distribution of the data and the relationships between variables.

Exploratory Data Analysis (EDA):

  1. Descriptive Statistics: Summary statistics for numerical and categorical variables.
  2. Visualizations: Histograms, box plots, and bar charts to understand distributions and relationships.
  3. Correlation Analysis: To see which numerical features are correlated with cancellations.

Let's start with descriptive statistics and some basic visualizations.

import matplotlib.pyplot as plt
import seaborn as sns

# Descriptive statistics for numerical variables
numerical_stats = data_cleaned.describe()

# Descriptive statistics for categorical variables
categorical_stats = data_cleaned.describe(include=['object'])

# Displaying numerical and categorical stats
numerical_stats, categorical_stats

The descriptive statistics provide a good overview of the numerical and categorical variables in the dataset:

Numerical Variables

  • no_of_adults: Mostly 2 adults per booking.
  • no_of_children: Mostly 0 children per booking.
  • no_of_weekend_nights: Usually 1 or 2 weekend nights per booking.
  • no_of_week_nights: Usually around 2 week nights per booking.
  • required_car_parking_space: Very few bookings require a car parking space.
  • lead_time: Varies widely, with an average of 85 days.
  • arrival_year: Mostly bookings for 2018.
  • arrival_month: Fairly evenly distributed across months.
  • arrival_date: Fairly evenly distributed across days.
  • repeated_guest: Very few repeated guests.
  • no_of_previous_cancellations: Very few previous cancellations.
  • no_of_previous_bookings_not_canceled: Very few previous bookings not canceled.
  • avg_price_per_room: Average price around 103.
  • no_of_special_requests: Usually 0 or 1 special request.

Categorical Variables

  • type_of_meal_plan: Most common is "Meal Plan 1".
  • room_type_reserved: Most common is "Room_Type 1".
  • market_segment_type: Most bookings are made online.
  • booking_status: Majority of bookings are not canceled.

Next Steps

  1. Visualizations:
    • Histogram for numerical variables to understand their distributions.
    • Bar charts for categorical variables to see the frequency of categories.
    • Box plots to visualize the relationship between numerical variables and booking_status.
  2. Correlation Analysis:
    • Check correlations between numerical variables and booking_status.

Let's start with the visualizations.

# Histograms for numerical variables
numerical_cols = ['no_of_adults', 'no_of_children', 'no_of_weekend_nights', 'no_of_week_nights',
                  'lead_time', 'avg_price_per_room', 'no_of_special_requests']

data_cleaned[numerical_cols].hist(bins=30, figsize=(15, 10))
plt.tight_layout()
plt.show()

The histograms provide a visual understanding of the distributions of the numerical variables:

  • No. of Adults: Most bookings have 2 adults.
  • No. of Children: Most bookings have no children.
  • No. of Weekend Nights: Most bookings include 1 or 2 weekend nights.
  • No. of Week Nights: Most bookings include 1 to 3 week nights.
  • Lead Time: There is a wide range, with many bookings made around 0-50 days before arrival, and a notable peak around 200-300 days.
  • Avg Price per Room: Prices vary widely, with a concentration around 50-150.
  • No. of Special Requests: Most bookings have no special requests, and fewer have 1 or 2 special requests.

Next, let's visualize the categorical variables.

# Bar charts for categorical variables
categorical_cols = ['type_of_meal_plan', 'required_car_parking_space', 'room_type_reserved', 'market_segment_type', 'booking_status']

fig, axes = plt.subplots(3, 2, figsize=(15, 15))

for i, col in enumerate(categorical_cols):
    ax = axes.flatten()[i]
    sns.countplot(data=data_cleaned, x=col, ax=ax)
    ax.set_title(f'Distribution of {col}')
    ax.set_xlabel('')
    ax.set_ylabel('Count')
    ax.tick_params(axis='x', rotation=45)  # Set X axis labels to 45 degree angle

# Remove the empty subplot
fig.delaxes(axes[2][1])

plt.tight_layout()
plt.show()

The bar charts for the categorical variables reveal the following:

  • Type of Meal Plan: Majority of bookings have "Meal Plan 1" or "Not Selected".
  • Required Car Parking Space: Very few bookings require a car parking space.
  • Room Type Reserved: "Room_Type 1" is the most commonly reserved room type.
  • Market Segment Type: Most bookings are made online.
  • Booking Status: Majority of bookings are not canceled.

Next, let's visualize the relationship between numerical variables and booking_status using box plots.