Cancellation Countdown

Imagine that you invite a dear friend to visit your home.

You spend countless hours preparing the house, cleaning every corner, cooking and baking delicious foods, and ensuring everything were perfect for their arrival...

But just as your friend was about to arrive, you received an unexpected message. Due to unforeseen circumstances, they wouldn't be able to make it after all...

The disappointment and frustration you would feel is a feeling shared by many in the hospitality industry.

Cancellations can have a negative impact on the:

revenue,
occupancy rates,
staffing and inventory planning,
reputation
and success of a hotel.

However, have you ever wondered what are the factors that influence a guest's decision to cancel their reservation?

Today, we will dive into the topic of hotel cancellations and examine the data and trends behind them. By understanding the causes of cancellations, we can work to minimize their impact on our hotel and improve the guest experience.

Executive Summary:

The study identifies key features influencing guest cancellations in hotel bookings. Lead time, average room price, special requests, and weeknight stays emerge as crucial predictors. Leveraging a Random Forest Classifier model, these variables achieve an 87% accuracy rate in predicting cancellations.

The analysis reveals significant trends regarding cancellation likelihoods based on various factors:

Seasonality: Summer sees higher cancellations compared to winter.
Booking Timing: Guests booking on Sundays are more likely to cancel than those booking on weekdays.
Booking Types: Online and corporate bookings exhibit differing cancellation patterns.
Reservation Duration: Longer stays (> 2 weeks) correlate with fewer cancellations.
Guest Status: Non-repeated guests tend to cancel bookings more frequently.
The majority of guests are last-minute corporate online bookers, mostly booking for one or two adults in Room Type 1. Repeat guests predominantly originate from the corporate segment, indicating successful last-minute room fill-up strategies.

General Recommendations:

Enhance online user experience and credibility of information.
Extend booking options and offer flexible cancellation policies.
Implement targeted promotions and loyalty programs.
Focus incentives and discounts for specific target segments.
Strengthen online presence and foster partnerships with local businesses and corporate entities.
Engage in local event hosting and offer curated experiences.
Collect feedback for continual improvement of facilities and services.
Optimize room allocation strategies for improved efficiency.

By implementing these recommendations, hotels can enhance guest satisfaction, minimize cancellations, and optimize revenue generation strategies.

Predicting Hotel Cancellations

🏨 Background

You are supporting a hotel with a project aimed to increase revenue from their room bookings. They believe that they can use data science to help them reduce the number of cancellations. This is where you come in!

They have asked you to use any appropriate methodology to identify what contributes to whether a booking will be fulfilled or cancelled. They intend to use the results of your work to reduce the chance someone cancels their booking.

The Data

They have provided you with their bookings data in a file called hotel_bookings.csv, which contains the following:

Column	Description
`Booking_ID`	Unique identifier of the booking.
`no_of_adults`	The number of adults.
`no_of_children`	The number of children.
`no_of_weekend_nights`	Number of weekend nights (Saturday or Sunday).
`no_of_week_nights`	Number of week nights (Monday to Friday).
`type_of_meal_plan`	Type of meal plan included in the booking.
`required_car_parking_space`	Whether a car parking space is required.
`room_type_reserved`	The type of room reserved.
`lead_time`	Number of days before the arrival date the booking was made.
`arrival_year`	Year of arrival.
`arrival_month`	Month of arrival.
`arrival_date`	Date of the month for arrival.
`market_segment_type`	How the booking was made.
`repeated_guest`	Whether the guest has previously stayed at the hotel.
`no_of_previous_cancellations`	Number of previous cancellations.
`no_of_previous_bookings_not_canceled`	Number of previous bookings that were canceled.
`avg_price_per_room`	Average price per day of the booking.
`no_of_special_requests`	Count of special requests made as part of the booking.
`booking_status`	Whether the booking was cancelled or not.

Source (data has been modified): https://www.kaggle.com/datasets/ahsan81/hotel-reservations-classification-dataset

import pandas as pd
hotels = pd.read_csv("data/hotel_bookings.csv")
hotels

The Challenge

Use your skills to produce recommendations for the hotel on what factors affect whether customers cancel their booking.

Time is ticking. Good luck!

Importing necessary libraries and modules for data analysis and machine learning.

#import necessary libraries and modules for data exploration and analysis

import numpy as np
import pandas as pd
from datetime import datetime

#import necessary libraries for data visualization
from matplotlib import pyplot as plt
import seaborn as sns
import missingno as msno
%matplotlib inline
plt.style.use('seaborn-white')

#import necessary libraries and modules for statistical analysis
from scipy import stats
from statsmodels.formula.api import logit
import pingouin

#import necessary libraries and modules for machine learning
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.model_selection import cross_val_score, KFold, train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.decomposition import PCA
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error as MSE
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, AdaBoostRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, precision_score, recall_score, f1_score
from sklearn.model_selection import RandomizedSearchCV
from sklearn.pipeline import Pipeline

Creating function:

def plot_and_stats(df, col_name):
    
    """
    Plot booking status counts and cancellation ratios for different groups 
    in a given column of a dataframe, and return statistics for a chi-squared 
    independence test.

        Parameters
    ----------
    df : pandas.DataFrame
        The input DataFrame.
    col_name : str
        The name of the column to plot.

    Returns
    -------
    pandas.Series
        A pandas Series of descriptive statistics for the specified column.

    Raises
    ------
    TypeError
        If df is not a pandas DataFrame or col_name is not a string.
    """
    
    import seaborn as sns
    import matplotlib.pyplot as plt
    import pingouin
    # reformat column name with underscores to title case with spaces to use it for the title and labels of the chart.
    col_name_formatted = col_name.replace("_", " ").title()
    
    # drop rows with missing data in specified column
    hotel_clean = df.dropna(subset=[col_name])
    
    # group booking status counts by specified column and reset index
    groupby_df = hotel_clean.groupby(col_name)['booking_status'].value_counts().rename('counts').reset_index()
    
    # group booking status cancellation ratios by specified column and reset index
    groupby_df_r = hotel_clean.groupby(col_name)['booking_status'].value_counts(normalize=True).rename('counts').reset_index()
    
    # subset booking status cancellation ratios for cancelled bookings only
    groupby_df_r_cancel = groupby_df_r[groupby_df_r['booking_status'] == 'Canceled']
    
    # compute statistics for chi-squared independence test between booking status and specified column
    stats = pingouin.chi2_independence(data=hotel_clean, x='booking_status', y=col_name)[2]
    
    # create subplots for countplot and barplot
    fig, axs = plt.subplots(1, 2, figsize=(15, 5))
    
    # Plot booking status counts by specified column
    sns.countplot(x=col_name, hue='booking_status', data=hotel_clean, ax=axs[0])
    
    #create a bar plot on the second subplot, showing the percentage of canceled bookings for each value in the specified column. The order of the values is sorted by the percentage of canceled bookings in descending order.
    sns.barplot(x=col_name, y="counts", data=groupby_df_r_cancel, order=groupby_df_r_cancel.sort_values('counts', ascending=False)[col_name], ax=axs[1])
    
    #adjusts the spacing between the subplots.
    plt.subplots_adjust(wspace=0.5)
    
    #set the labels and titles of the subplots, and rotate the x-axis tick labels by 70 degrees.
    axs[0].tick_params(axis='x', rotation=70)
    axs[0].set_xlabel(col_name_formatted)
    axs[0].set_ylabel('Booking Status')
    axs[0].set_title('Booking Status vs. {}'.format(col_name_formatted), fontsize=18, color='black')

    axs[1].tick_params(axis='x', rotation=70)
    axs[1].set_xlabel(col_name_formatted)
    axs[1].set_ylabel('Cancellation Ratio')
    axs[1].set_title('Cancellation Ratio vs. {}'.format(col_name_formatted), fontsize=18, color='black')
    
    #displays the plot.
    plt.show()
    
    #return the chi
    return stats, groupby_df, groupby_df_r_cancel

def plot_and_logreg(df, col_name):
    """
    This function creates a histogram of the specified column split by booking status, a logistic regression model
    predicting booking status based on the specified column, and a scatter plot with trend lines showing the cancellation
    ratio for each value of the specified column. It also computes the correlation coefficient between the cancellation ratio
    and the specified column.


    Parameters
    ----------
    df : pandas.DataFrame
        The input DataFrame.
    col_name : str
        The name of the column to plot.

    Returns
    -------
    tuple
        A tuple of descriptive statistics for the specified column.
    """

    # Drop rows with missing data in specified column
    df_c = df.dropna(subset=[col_name])

    # Create the histograms of specified column split by booking status
    plot = sns.displot(data=df_c, x=col_name, col="booking_status",  hue="booking_status", legend=False)
    
    # reformat column name with underscores to title case with spaces to use it for the title and labels of the chart.
    col_name_formatted2 = col_name.replace("_", " ").title()
    
    plot.fig.suptitle(col_name_formatted2, y=1.1)


    # Hide the legend
    if plot._legend is not None:
        plot._legend.remove()

    # Show the plot
    plt.show()
    
    df['booking_status_num'] = df['booking_status'].replace({'Not_Canceled': 0, 'Canceled': 1})
    
    # Fit a logistic regression of churn vs. length of relationship using the churn dataset
    mdl_bookingstatus_vs_specificcolumn = logit(f"booking_status_num ~ {col_name}", data=df).fit()

    # Print the parameters of the fitted model
    print(mdl_bookingstatus_vs_specificcolumn.params)

    # Create a subplot with two columns
    fig, axs = plt.subplots(ncols=2, figsize=(12, 4))

    # Plot the logistic regression trend line and a scatter plot of specific column vs. booking_status in the first column
    sns.regplot(x=col_name,
                y="booking_status_num",
                data=df, 
                ci=None,
                logistic=True,
                scatter_kws={'color': 'orange'},
                line_kws={"color": "red"},
                ax=axs[0])
    axs[0].set_title("Logistic Regression")

    # Compute the cancellation ratio for each avg_price_per_room value
    cancel_ratio = df.groupby(col_name)['booking_status'].value_counts(normalize=True).loc[:, 'Canceled'].reset_index(name='cancellation_ratio')

    # Plot the cancellation ratio by avg_price_per_room using a scatter plot with a trendline in the second column
    sns.regplot(x=col_name, y='cancellation_ratio', data=cancel_ratio, scatter_kws={'color': 'orange'},
               line_kws={'color': 'red'}, ax=axs[1])
    axs[1].set_title("Cancellation Ratio")

    # Calculate the correlation coefficient
    corr_coef = cancel_ratio[col_name].corr(cancel_ratio['cancellation_ratio'])

    # Print the correlation coefficient
    print('Correlation Coefficient: {:.2f}'.format(corr_coef))

    # Show the plot
    plt.show()

    # Return the descriptive statistics
    return df_c[col_name].describe()

# print the first 5 rows of the DataFrame
hotels.head()

# get information about the DataFrame
hotels.info()

# get the summary statistics of numerical and categorical columns
print(hotels.describe(include='all'))

Exploring missing values

‌
‌
‌