Identifying important features that predicts hotel cancelation

Predicting Hotel Cancellations

🏨 Background

You are supporting a hotel with a project aimed to increase revenue from their room bookings. They believe that they can use data science to help them reduce the number of cancellations. This is where you come in!

They have asked you to use any appropriate methodology to identify what contributes to whether a booking will be fulfilled or cancelled. They intend to use the results of your work to reduce the chance someone cancels their booking.

Summary

Business Objective: The goal of the project is to increase revenue from room bookings. It was tasked to identify factors that contributes when a booking will be fulfilled or not.

Methodology:

Data cleaning and validation - The values in the columns conforms their respective column properties. However, there is a lot of null values in the dataset.
Exploratory analysis - Arrival year, arrival month, and arrival date is seem to have a weak or no effect on reservation cancels. No columns are dropped due to very strong correlation.
Feature selection and null imputation - The columns that seem to have weak or no effect on reservation cancels are dropped. Observations that are 80 percent nulls are also dropped. Nulls were simply imputated with the most logical default value for the column. Another dataset was saved in which the nulls were dropped.
Machine learning - Three basic models are used which are KNN, logistic regression, and random forest classifier and both imputated and dropped nulls dataset were used. Models using the dropped nulls dataset performs slightly better and the random forest classifier has the highest accuracy score for both datasets. Hence, this was used to investigate feature importance.
Model feature analysis - based on the random forest classifier. The top 5 features are lead time, average price per room, number of special requests, number of week nights, and number of weekend nights.

Recommendations:

Technical
- Extract features from the arrival year, arrival month, and arrival date. Combining these columns will give a specific date in which it can be identified if it was a holiday or if there was a storm. Information about the location of the hotel is needed for this extraction.
- Apply multiple imputation by chained equations on the dataset and test if it would yield a higher accuracy score. This was the initial plan but fancy impute was not supported in the current workspace.
- Models used are baseline models only. Tune each model and see which performs best and check if the top features remains the same.
Business
- Limit the lead time for hotel reservation since reservations with higher lead time tend to be canceled more. A lead time of 3 to 6 months would be reasonable.
- If the average price per room could be changed, an average price between 100 and 200 is a good range. However, this would also depend on the hotel's services and reputation. If the average price is too high, bookings have a higher tendency to be canceled.
- Accomodate special requests. Bookings with higher special requests tend to cancel less.
- Since the business cannot control the length of stay, use marketing strategies that would encourage people to not cancel their booking such as discounts or bonuses if they have a longer length of stay.

Dataset

The stakeholders provided their bookings data in a file called hotel_bookings.csv, which contains the following:

Column	Description
`Booking_ID`	Unique identifier of the booking.
`no_of_adults`	The number of adults.
`no_of_children`	The number of children.
`no_of_weekend_nights`	Number of weekend nights (Saturday or Sunday).
`no_of_week_nights`	Number of week nights (Monday to Friday).
`type_of_meal_plan`	Type of meal plan included in the booking.
`required_car_parking_space`	Whether a car parking space is required.
`room_type_reserved`	The type of room reserved.
`lead_time`	Number of days before the arrival date the booking was made.
`arrival_year`	Year of arrival.
`arrival_month`	Month of arrival.
`arrival_date`	Date of the month for arrival.
`market_segment_type`	How the booking was made.
`repeated_guest`	Whether the guest has previously stayed at the hotel.
`no_of_previous_cancellations`	Number of previous cancellations.
`no_of_previous_bookings_not_canceled`	Number of previous bookings that were canceled.
`avg_price_per_room`	Average price per day of the booking.
`no_of_special_requests`	Count of special requests made as part of the booking.
`booking_status`	Whether the booking was cancelled or not.

Source (data has been modified): https://www.kaggle.com/datasets/ahsan81/hotel-reservations-classification-dataset

Methodology

1. Data Cleaning and Validation

The dataset has 36275 rows and 19 columns. The following columns are validated:

Booking_ID: All values are unique, 0 null values.
no_of_adults: Numeric values without values below 0, 413 null values.
no_of_children: Numeric values without values below 0, 324 null values.
no_of_weekend_nights: Numeric values without values below 0, 367 null values.
no_of_week_nights: Numeric values without values below 0, 807 null values.
type_of_meal_plan: 4 unique categorical values, 526 null values.
required_car_parking_space: Boolean column as numeric values, 2592 null values.
room_type_reserved: 7 unique categorical values, 1171 null values.
lead_time: Numerical values without values below 0, 472 null values.
arrival_year: 2 numerical values (2017 and 2018), 378 null values.
arrival_month: 12 numerical values (1 to 12), 504 null values.
arrival_date: 31 numerical values (1 to 31), 981 null values.
market_segment_type: 5 unique categorical values, 1512 null values.
repeated_guest: Boolean column as numeric values, 586 missing values.
no_of_previous_cancellations: Numeric values without values below 0, 497 null values.
no_of_previous_bookings_not_canceled: Numeric values without values below 0, 550 null values.
avg_price_per_room: Numeric values without values below 0, 460 null values.
no_of_special_requests: Numeric values without values below 0, 789 null values.
booking_status: 2 unique categorical values, 0 null values.

6 hidden cells

Since the main focus is to identify whether the guest will cancel the reservation or not, the last column will be converted into a boolean column titled canceled with 1 as true and 0 as false. The booking ID will also be dropped since this is not a feature for the model.

1 hidden cell

2. Exploratory Analysis

The effects of each feature if a reservation will be canceled or not is plotted depending on the data type of the column. A regression plot is used for numeric columns while a bar plot is used for the means of cancelation for categorical columns.

sns.set_palette('mako')

for col in hotels.columns:
    if hotels[col].dtype.name == 'category':
        grouped = hotels.groupby(col, as_index=False)['canceled'].agg({'means':np.mean})
        sns.barplot(x=col, y='means', data=grouped)
        plt.title('canceled means per {}'.format(col))
        plt.xticks(fontsize=7)
        plt.show()
        
    elif hotels[col].dtype.name == 'float64':
        sns.regplot(x=col, y='canceled', data=hotels)
        plt.title('{} vs canceled'.format(col))
        plt.show()

Observations:

no_of_adults: higher number of adults tend to cancel more.
no_of_children: higher number of children tend to cancel more.
no_of_weekend_nights: higher number of weekend nights slightly cancel more.
no_of_weeknights: higher number of weeknights cancel more.
type_of_meal_plan: meal plan 2 has the highest cancel rate while meal plan 3 has the lowest.
required_parking_space: those that require a parking space tend to cancel less.
room_type_reserved: room type 6 has the highest cancel rate while room type 7 has the lowest.
lead_time: higher lead time tend to cancel more.
arrival_year: 2018 has a higher cancel rate compared to year 2017
arrival_month: no significant differences.
arrival_date: no significant differences.
market_segment_type: online segment has the highest cancel rate while there are no cancels in complimentary.
repeated_guest: repeated guests tend to cancel less.
no_of_previous_cancelation: those with higher number of previous cancelation has a significant chance to not cancel.
no_of_previous_bookings_not_canceled: higher number of previous bookings not canceled tend to not cancel
avg_price_per_room: higher average price tend to have more cancels.
no_of_special_requests: higher special requests tend to cancel less.

The columns arrival month, and arrival date will be dropped since there is no significant differences between their values. Since these are date values, it is also decided to drop arrival year since the dates seem to have weak trends. However, it is recommended to extract features by combining these columns to produce a date and identify if there is a factor on that date which could influence booking cancelation such as holidays or storms. Unfortunately, this is not explored because there is no information about the location of the hotel.

Since using two features that are highly correlated in a model is redundant, a correlation heatmap would help in deciding if there are droppable features.

Hidden code

Since there are no very strong correlation between featrues, no features will be dropped due to correlation.

3. Feature Selection and Null Imputation

As mentioned in the last chapter, the columns arrival year, arrival month and arrival date will be dropped. Observations that are mostly null values will also be dropped using a threshold of 80 percent since there will be too many assumptions for these observations.

Since there are a lot of null values, different visualizations are used to identify if there are trends in the nulls. A matrix, heatmap, and dendrogram is used for the visualization.

‌
‌
‌