Competition - predict hotel cancellation

Analysis of factors affecting hotel cancellations

1. Summary

Most important factors determining booking status outcome

In this report, variables that are affecting whether a hotel reservation might be cancelled or not are explored, based on the provided data. Two different approaches are utilized to ensure reliability of the results - Correlation of each variable with with the target variable (the booking status), and analysis of the coefficients of a logistic regression model, which estimates the odds of a booking being cancelled or not based on the provided data.

Fig.1 The 10 most important variables affecting the booking status (5 variables with largest positive and 5 variables with largest negative correlation) according to correlation with booking status. A positive correlation in this context means that the variable is associated with bookings ending in cancellation, whereas a negative correlation in this contexts means the variable is associated with bookings that do not end in cancellation.

Spearman's Rho ranges from 1 to -1, with 1 meaning perfect correlation between the variables, and -1 meaning perfect inverse correlation between variables. Spearman's Rho is a non-parametric statistic, so it does not assume that the used variables are normally distributed. It expects however that the data are ranked, which is an assumption that is violated here, so we should not rely on the correlations shown above alone.

Fig. 2 The 10 most important variables affecting booking status (5 variables with largest positive and 5 variables largest negative coefficients) according to logistic regression coefficients. Values above the blue line indicate that the variables contribute to higher odds of a booking being cancelled. Values below the orange dotted line indicate that the variable contributes to lower odds of a booking being cancelled. The further the values are from 1, the more the higher the variables contribution to either cancellation or not cancellation

Logistic regression estimates the log odds (which can be transformed into an odds ratio, i.e cancellation:no_cancellation) of a booking being cancelled vs not being cancelled and assigns the label according to the probability. Analysis of the coefficients shows which variables contribute to either cancelling or not cancelling a booking.

The figures show that both methods agree for several factors - Lead time, average room price and online bookings are in both analyses associated with higher odds of cancellation (i.e the probability that the booking will be cancelled is > 50%). The number of special requests, number of previous bookings not cancelled and whether a car parking is required are factors that decrease the odds of a booking being cancelled. The full figures with all factor correlations/coefficients can be found in the appendix. Agreement of the two different methods suggests that the results are robust to violations of some of the underlying assumptions, indicating that the effects of these variables on whether a booking is cancelled or not are strong.

Other variables, such as arrival month or meal plan, seem to have an effect in one of the analysis, but not the other. This indicates that the influence of these variables on booking status is lower and less robust to violation of statistical assumptions for the respective test.

In summary, I recommend focusing on the variables Lead time, average room price, online bookings, number of special requests, number of previous bookings not cancelled and required car parking space when trying to reduce cancellations.

2. Methods

Missing data were imputed using K-nearest neighbors (KNN) imputation. The optimal numer of k was selected through assessing the performance of a random forest classifier trained and tested iteratively on data imputed using different values for k. Relevant features were selected using recursive feature elemination (RFE) with a logistic regression model. The coefficients described above were obtained from a logistic regression model trained on the data post-imputation, with the features selected by RFE (see code in appendix).

3. Appendix

3.1 Code

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

df = pd.read_csv("data/hotel_bookings.csv")
df.head(10)

#Check number of missing values
df.isna().sum()

#There are lots of records that have missing values. Later we will build a model where we remove all NaNs, and one with the imputed values to see which one performs better. For now, we have to impute the values.

#The 'Boking ID' column holds no information we can use atm, so we can drop it
df1=df.drop('Booking_ID', axis=1)

#Lets check in which categorical columns the values can just be replaced with 1 and 0
for col in ['booking_status','type_of_meal_plan', 'room_type_reserved', 'market_segment_type']:
    print(df[col].value_counts())

#Now we use KNN to cluster the data, then we use the values from the cluster to impute the missing ones

#As KNN works on the distances between data points, for the algorithm to work properly we need to normalize the data onto a scale of 1-0. First we visualize the distribution of the non-binary data
nobin=df[[c for c in df.columns if df[c].dtypes!='object']]

#Now we plot the distributions of the data - normalization cannot deal with outliers, so we want to make sure there are none
fig, axes = plt.subplots(4, 4, figsize=[10, 10])
fig.subplots_adjust(wspace=0.7, hspace=0.7)
fig.suptitle('Numeric variable distributions')

for ax, c in zip(axes.flatten(), nobin.columns):
    axi=sns.histplot(nobin[c], ax=ax)
    
fig.delaxes(axes[3][3])
fig.delaxes(axes[3][2])

#Create dataframe to show potential associations between variables
ass_df=df1.copy()
ass_df.dropna(inplace=True)
ass_df['booking_status']=ass_df['booking_status'].apply(lambda x: 1 if x=='Canceled' else 0 )
ass_df=pd.get_dummies(ass_df)

sns.clustermap(ass_df.corr(method='spearman'))

#Replace booking status with 1 or 0
df1['booking_status']=df1['booking_status'].replace({'Canceled':1, 'Not_Canceled':0})

#Replace arrival month with quartals to reduce number of dummy variables while one-hot encoding
quartals={1:'Q1', 2:'Q1', 3:'Q1', 4:'Q2', 5:'Q2', 6:'Q2', 7:'Q3', 8:'Q3', 9:'Q3', 10:'Q4', 11:'Q4', 12:'Q4'}

df1['arrival_month']=df1['arrival_month'].apply(lambda x: quartals[x] if x in quartals else x)

#We also drop the repeated guest column, and the number of previous cancellations column, as these are correlated
#with the number of previous uncancelled bookings
df1.drop(['repeated_guest', 'no_of_previous_cancellations'], axis=1, inplace=True)

#Split data into target and features, transfor to np.array (also get dummy variables fot the features)
target=df1['booking_status'].to_numpy().reshape(-1,1)
features=pd.get_dummies(df1.drop('booking_status', axis=1)).to_numpy()

#Complete dataframe with dummy variables
ddf=pd.get_dummies(df1)

print(target.shape, features.shape)

df1.head(10)

#To use KNN, we scale all values between 0 and 1. We do this after splitting the set into train/test values in order to avoid data leakage, and we transfor all of our variables, so we use the ddf data frame
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import KNNImputer

scaler=MinMaxScaler()

#Now impute values using KNN imputation for different values of K. After, use a random forest classifier to calculate an accurracy score 
score_dict={}

for k in [1, 2, 3, 4, 5, 6]:
    
    #Scale features
    scaler=MinMaxScaler()
    scaled_feats=scaler.fit_transform(features)
    
    #Impute the features
    imputer=KNNImputer(n_neighbors=k)
    impfeats=imputer.fit_transform(scaled_feats)
    
    #split dataset into train/test sets
    x_train, x_test, y_train, y_test=train_test_split(impfeats, target, test_size=0.3)
    
    #Run RFC to assess accurracy
    rfc=RandomForestClassifier(n_estimators=100, n_jobs=10)
    rfc.fit(x_train, y_train)
    accuracy=rfc.score(x_test, y_test)
    
    score_dict[k]=accuracy

#Check accurracy score
print(score_dict)

#k=5 seems to produce the best score, so we use that for our final imputations


#Do imputation
scaler=MinMaxScaler()
scaled_feats=scaler.fit_transform(ddf.to_numpy())

imputer=KNNImputer(n_neighbors=5)
imp_feats=imputer.fit_transform(scaled_feats)
imp_feats_unscaled=scaler.inverse_transform(imp_feats)

imp_df=pd.DataFrame(imp_feats_unscaled, columns=ddf.columns)

#Check nan values after imputation
imp_df.info()

#The imputer partially returns floats for binary data/integers. We check which columns (originally) contain which type of values
#Only 'avg_price_per_room' is truly continuos. Based on that, we will round all other columns.
import math

#If first decimal after comma is >=5 round up, else round down
columns=[c for c in imp_df if not c=='avg_price_per_room']
for c in columns:
    imp_df[c]=imp_df[c].apply(lambda x: math.ceil(x) if int(str(x).split('.')[1])>=0.5 else math.floor(x))

#Now we want to see whether the different variables are correlated. For that, we will use different correlation measures and see how they compare. We'll use Pearsons R (countinous vs continous, -1 to 1), Cramers V (categorical vs categorical, 0-1), the point biserral correlation coeffocient (continous vs binary, -1 to -1), and kendalls tau-b (non parametric, -1 to 1)

sns.clustermap(imp_df.corr(method='pearson'))

‌
‌
‌