EXECUTIVE SUMMARY
Our analysis of the data indicates that there are several factors, such as lead_time, repeated_visits, no_of_special_requests, and room_type, that the group may be able to influence to help mitigate the risk of cancellation. It does appear that increasing the number of weekend nights increases the likelihood of cancellation, but when we consider that we do not have information about whether the cancellations disproportionately involve weekend nights (since the data do not specify, and any stay longer than two days must include weekdays as well), we will reserve recommending any actions until additional studies are performed.
All other recommendations, available in the Recommendations section at the end of this report, include specific suggestions for each variable of interest. Additionally, we have taken care to acknowledge weaknesses of the data, and to make recommendations for future research directions.
We have constructed a model that predicts the likelihood of a cancellation to more than
The following report organizes our findings and recommendations by section, according to the needs of the reader. For high-level summary, please direct your attention to the Introduction and Conclusion sections. For details on methodology and data treatment, please see Data Cleaning and Model Selection, followed by a look through the Analysis. For additional information, or to inspect the code, please see the Appendix.
INTRODUCTION
We have been retained to support a hotel with a project aimed to increase revenue from their room bookings. The hotel group has asked us to use our skills to investigate which factors affect whether a guest will cancel their booking(s) and to produce recommendations on how to reduce the likelihood of a cancellation. In order to fulfill this request, we will use a combination of SQL and Python tools to compile, clean, describe, and analyze the data provided.
All code related to the analysis is included in the appendix, along with plots and visualizations referenced in, but not included in, the main body of the report. Snapshots of the data as it progressed from ingestion through the pipeline to analysis are also available for inspection in the appendix.
Our report will provide detailed information on the analysis process, as well as a set of recommendations based on the data to assist the hotel group in determining how best to optimize their bookings' outcomes.
DATA CLEANING AND MODEL SELECTION
Initial Inspection and Findings
We were provided a raw data set by the hotel group, which we inspected using pandas (see attached code) in Python in conjunction with SQL. The original data set consists of
Plots of categorical distributions are made for initial inspection (see the Appendix, or Plot 1: Histograms for Inspection). Each quantitative variable is plotted against every other to inspect for significant interactions (see the Appendix, or Plot 2: Scatterplots for Inspection).
Table 1: Data Description and Summary
| Column (Variable) | Description | Type | Null/Missing |
|---|---|---|---|
| Booking_ID | unique identifier of each booking | String | 0 |
| no_of_adults | Number of adults | Integer | 413 |
| no_of_children | Number of Children | Integer | 324 |
| no_of_weekend_nights | Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel | Integer | 367 |
| no_of_week_nights | Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel | Integer | 807 |
| type_of_meal_plan | Type of meal plan booked by the customer | String | 526 |
| required_car_parking_space | Does the customer require a car parking space? | Boolean | 2592 |
| room_type_reserved | Type of room reserved by the customer | String | 1171 |
| lead_time | Number of days between the date of booking and the arrival date | Integer | 472 |
| arrival_year | Year of arrival date | Integer | 378 |
| arrival_month | Month of arrival date | Integer | 504 |
| arrival_date | Day of arrival date | Integer | 981 |
| market_segment_type | Market segment designation | String | 1512 |
| repeated_guest | Is the customer a repeated guest? | Boolean | 586 |
| no_of_previous_cancellations | Number of previous bookings that were canceled by the customer prior to the current booking | Integer | 497 |
| no_of_previous_bookings_not_canceled | Number of previous bookings not canceled by the customer prior to the current booking | Integer | 550 |
| avg_price_per_room | Average price per day of the reservation; prices of the rooms are dynamic | Float | 460 |
| no_of_special_requests | Total number of special requests made by the customer | Integer | 789 |
| booking_status | Flag indicating if the booking was canceled or not | Boolean | 0 |
Dealing with Interaction Terms
To measure significance of possible interactions between variables, we compute the Pearson Moment Correlation Coefficient, Variance inflation Factor, and p-Value for each pair. Weak correlations appear to exist between no_previous_cancellations and no_previous_bookings_not_cancelled as well as between no_of_children and room_type_reserved.
The relationships between these variables is not strong enough to exclude them from initial analysis (
Model and Final Variable Selection
Given that the response column, booking_status, is binary, we choose logistic regression for primary mode of analysis. We apply a binary value mapping to booking_status and year so that Cancellation and
Regression analysis was iteratively performed using the statsmodels.api package, with the worst-performing (in terms of p-value) independent variable removed at each successive iteration. In this way we remove those variables that do not significantly influence the outcome without overlooking possible hidden effects (see Table 3: Final Variable Selections (Iterative Regressions)).
To deal with the problem of the null values, we removed first the required_car_parking_space column, and then all remaining rows with nulls. Given that preliminary regression demonstrated that required_car_parking_space was insignificant to the overall model outcomes, and that after removing it, only
Summary of Cleansed and Imputed Data
To convert categorical variables to quantitative for regression, the Weight of Evidence method was chosen, since it produces intermediate values (log-odds) that are approximately normally distributed (See Plot 6 below) and are meaningful in terms of representing the effect each subcategory has on outcomes.
Plot 6: Distribution of Log-Odds from Final Model
(NOTE: The extreme low outliers that you notice above are those bookings in the Complimentary category of the market_segment variable, which while they carry a large weight, are so rare that they do not make us question our assessment that log-odds are a normally distributed variable.)
The resulting cleansed and imputed data set has
The resulting data values used to build the final model are listed here, in Table 3: Final Variable Selections:
Table 3: Final Variable Selections (Iterative Regressions)
Since the coefficients do not significantly change as poorly performing variables are removed, we can be confident that we are not overlooking significant interactions or changing the model by applying the successive elimination method.
- After first run, we removed no_of_children (
) - After second run, we removed no_of_previous_bookings_not_canceled (
) - After third run, we removed no_of_previous_cancellations (
)
Explanation of Intermediate Model Values and their Effects
From the original set of variables we select the final variables based on their significance and model effects (see Table 2: Model Weights and Effects (Summary and Range)). Our resulting predictive model, which has no meaningful intercept, is given as
where the random variable
(See Plot 6 above for the distribution of
If the probability of cancellation is less than
Using these parameter weights, we are able to construct the following table to illustrate the extreme and midpoint values for each parameter (see the table below). This enables us to compare the effect that each level of change on a given parameter has on the model outcome. In other words, this is our basis for determining which model parameters are the most significant to determining whether a guest will or will not cancel their reservation.
NOTE: While some effects have much greater (in magnitude) possible input values, whether this translates to greater effect depends in part on the coefficient weight. For example, market_segment_type_woe (the quantitative transformation that assigns a meaningful numeric value to market segment category) has a minimum input (lower extreme) of
Table 2: Model Weights and Effects (Summary and Range)
| Factor | Weight | Min. Input | Max. Input | Median Input | Min. Output | Max. Output | Median Output |
|---|---|---|---|---|---|---|---|
| no_of_adults | -0.4752 | 0 | 4 | 2 | -1.9008 | 0.0000 | -0.9504 |
| no_of_weekend_nights | 0.0636 | 0 | 6 | 1 | 0.0000 | 0.3816 | 0.0636 |
| no_of_week_nights | -0.0362 | 0 | 17 | 2 | -0.6154 | 0.0000 | -0.0724 |
| lead_time | 0.0115 | 0 | 443 | 57 | 0.0000 | 5.0945 | 0.6555 |
| arrival_year | -0.9236 | 0 | 1 | 0 | -0.9236 | 0.0000 | 0.0000 |
| arrival_month | -0.0763 | 1 | 12 | 8 | -0.9156 | -0.0763 | -0.6104 |
| arrival_date | -0.0166 | 1 | 31 | 16 | -0.5146 | -0.0166 | -0.2656 |
| repeated_guest | -1.7944 | 0 | 1 | 0 | -1.7944 | 0.0000 | 0.0000 |
| avg_price_per_room | 0.0067 | 0 | 540 | 99.45 | 0.0000 | 3.6180 | 0.6663 |
| no_of_special_requests | -1.2029 | 0 | 5 | 0 | -6.0145 | 0.0000 | 0.0000 |
| type_of_meal_plan_woe | 0.2615 | -0.6671 | 0.5503 | -0.0721 | -0.1744 | 0.1439 | -0.0189 |
| room_type_reserved_woe | 1.2358 | -0.5187 | 0.4219 | -0.0240 | -0.6410 | 0.5214 | -0.0297 |
| market_segment_type_woe | 2.1180 | -21.1703 | 0.1677 | 0.1677 | -44.8387 | 0.3552 | 0.3552 |
Validating and Testing the Model
By applying the model to the existing data and comparing the data results with the predicted model outcomes (see Plot 4), we find that we have an overall true-positive rate of
Plot 4: Observed vs. Predicted Outcomes by Frequency
which is consistent with the regression model obtained by sklearn.linear_model to the data set (see Plot 3).
Plot 3: Receiver Operating Characteristic for Trained Model
We also plot the regression probabilities by linear sums of the model outcomes (log-odds) to inspect the distribution of outcomes (see Plot 5).
Plot 5: Logistic Comparison of Model Predictions vs. Observed Results
Our threshold is set to
ANALYSIS
Now that we have confirmed the accuracy of our model, we can confidently use its parameter values to investigate the effects that the various observed factors exert on actual outcomes. Based on our table of weights and coefficients (Table 2: Model Weights and Effects (Summary and Range)), we can model multiple scenarios to determine resulting effects on the likelihood of cancellation. The magnitude of each effect depends on both the size of the coefficient and the size of the input parameter. For our binary variables, this is either equal to zero or the size of the coefficient, but for our imputed and continuous variables, we will compare (more or less) continuous changes, based on differing combinations of inputs and coefficients.
By applying this approach to all variables, we find that the expected value for the log-odds
where
That is, we are estimating the expected probability of cancellation under our model, using the median coefficient values for each parameter, at about
Table of Proportions and Counts, pre- and post- Cleaning
Discussion and Recommendations by Variable
The following is an enumeration and discussion of the variables of interest that were identified by observing their likely values, magnitude of effect, and potential for improvement through targeted intervention.
Number of Adults (no_of_adults)
The number of adults on the reservation is important to the outcome, with the risk of cancellation for singles
We recommend the hotel group take actions to market their hotel experience to individuals by using targeted community advertising, special pricing and promotions, and direct marketing to help individuals reimagine their hotel stay experience.
Number of Weekend Nights in the Booking (no_of_weekend_nights)
The number of weekend nights included in each booking ranges from
Plot 7: Distribution of Weekend Nights Booked
It is unclear whether this effect is impacted by seasonality, or whether it is autocorrelated, since we only know whether a night was a weekend night, and do not know when that night occurred. We recommend a second look at this feature, using a data set wherein the date-time stamps are preserved, in order to better assess for time-series effects.
Given that weekends seem to increase the likelihood of cancellation, we propose that the hotel group consider promoting special offers and price incentives for weekend nights included in the booking. By offering special discounts for just weekend nights, it may be possible to offset the revenue impacts of increased cancellation.
That said, this effect is certainly not the weightiest in terms of overall model outcomes, so that it may not be particularly beneficial to invest too much into mitigation strategies centered on it.
Number of Weeknights in the Booking (no_of_week_nights)
Similarly to the number of weekend nights, the effect on model outcomes exerted by number of weeknights is limited, and since there is a maximum of
You can see in the distribution of weeknights below that the majority of bookings include
Plot 8: Distribution of Weeknights Booked
We do however recommend the inclusion of time-stamp data in future studies, so that we can inspect for patterns that may be missed in our current disaggregated data.
Lead Time in Days (lead_time)
As lead_time increases, so does the probability of cancellation. Holding all other factors constant (at median levels), we see that a change of
Plot 9: Distribution of Lead Times
Since the chance of a conflict emerging or change in plans is likely to increase with time, this is a reasonable finding. We recommend that the reminder system be structured to send reminders to each guest automatically when they are within
The reminders should emphasize the cancellation policy so that the guest is able to cancel as early as possible if they determine the need to cancel at all. The reasoning here is that by getting the guest to cancel early, we can return the room to the available pool and try to secure another reservation in a reasonable window. Since the median lead time is
Additionally, if the reservation is cancelled with fewer than
Arrival Month (arrival_month)
It is not surprising that cancellation rates are lower in the winter months, since these are times during which holiday celebrations increase the demand for hotel rooms, and reservations seem generally 'stickier' from October through December. According to our model, each change of one month corresponds to a change in cancellation likelihood of
Given that this association is likely due to factors entirely outside the control of the hotelier, it may be more useful to use this information as an invitation to study seasonal effects on 'stickiness', volume, and capacity. It may be beneficial to use the seasonal ebb and flow as a basis for scheduling large-scale maintenance and upkeep procedures.
In other words, with respect to the arrival month, it may be most beneficial to use this information to predict when the best time to accomplish necessary tasks that put strain on staff and limit the availability of some spaces within the hotel, rather than attempting to use this information to directly reduce cancellation rates.
Repeated Guest (repeated_guest)
Whether a guest has previously stayed at the hotel seems to be a particularly weighty factor in estimating their likelihood of cancellation. Since this is a binary variable, we do not know how often they stayed in the past, but the model prediction is that if a guest has stayed at the hotel previously, the likelihood that the current booking will be cancelled is reduced by as much as
There are a number of possible reasons for this finding, such as geolocation, market segment, professional association, and even personality. That is,
- if the hotel is located near an attraction or location to which the guest will travel periodically, such as a family home, conference, or event, it is plausible that the fact that they are repeat guests is coincidental to location;
- if the hotel is booked by a sponsor or employer on behalf of the guest, in the case of a professional conference, or as part of a travel cycle, it is possible that the repeat behavior is due to factors other than personal experience;
- if the same set of guests are responsible for this repeated behavior, it may not be inferentially sound to assume that simply having stayed at the hotel before is a predictor for non-cancellation in the future.
We recommend a more detailed look at a data set that includes individual demographics (anonymized), in order to probe more deeply into this effect. Given the magnitude, it seems worthwhile to determine whether there is a causal relationship.
In particular, it will be very useful to analyze the number of times they have been a guest, what their market segment has been, during what time(s) of year they arrive, and whether they typically travel alone or as part of a group.
Average Room Price per Night (avg_price_per_room)
The average price per room overall is
We recommend that where possible, sales and discounts be offered to guests who are already at higher risk of cancelling, to help mitigate the overall risk. Additionally, we recommend a more detailed study of the financial impacts of cancellations versus lowering prices, in order to determine how best to act on this finding.
Number of Special Requests (special_requests)
The number of special requests exerts a strong effect on the outcome, with the model predicting that even a single special request could reduce the likelihood of a cancellation by up to
It may be prudent for hotel management to implement policies that make it easier for individuals to include special requests in their bookings. While there are questions remaining around the specific nature of the requests, the effect is uncorrelated to the other independent variables under consideration. Implementing a system of customized recommendations to improve the stickiness of the reservation seems to be a relatively low-cost strategy, and the effects of such an intervention would be easily testable, since the booking population could be sampled at random and tested using a simple A/B framework.
Type of Meal Plan (meal_plan)
Meal plan is less influential to outcomes, with the likelihood varying
Room Type (room_type)
Room type has a significant effect on the likelihood of cancellation, but the room types are encoded, so that we do not know what particular characteristics may account for different levels of effect.
Room_Type 1 makes up
Market Segment (market_segment_type)
Given that the weight of evidence (
Changing market_segment_type form Complimentary to Online (
Given that complimentary bookings make up less than
It is our recommmendation that interventions to this effect focus on developing incentives and reminder systems to improve the stickiness around online bookings, rather than directly changing current booking systems.
Conclusion and Recommendations
Our final model, when tested against the data, demonstrated a high degree of predictive accuracy on the outcomes. Based on the parameter values, we analyzed the effects that each feature has on whether or not a guest may cancel their booking (
Recommendations
Because it is difficult to ascertain whether certain effects are indicators of increased cancellation risk or whether they are influencers, we will restrict our recommendations to identifying those reservations that are at higher risk of cancellation, so that the hotelier(s) have the best chance at either converting a cancellation to a rebooking, or preventing a cancellation through systematic direct marketing approaches.
We will first look at those factors where the hotel group may be able to directly influence outcomes, followed by a discussion of how to optimize outcomes for those bookings that cannot have their risk profile improved.
Repeat Visits
Given that repeated guests are
Special Requests
We cannot be fully certain whether special_requests is an indicator or and influencer, but it would be relatively simple to implement a recommender system that would convert basic demographic information into personal recommendations that might encourage special requests. This may include particular amenities such as views, access to restaurants and shops, or even local taxi and shuttle services. It would be simple to collect data to analyze whether special requests cause more stickiness with the booking, or whether they simply indicate it.
Room Type
Where possible, it may be useful to allocate rooms types associated with lower risk to those guests who carry inherent risk in their profile, such as single guests or new guests.
Lead Time
While we cannot change when guests make their reservations, we can implement automated systems that remind guests of their reservations at fixed intervals, to try to mitigate the risk of booking conflicts. We recommend sending reminders to the guest at
- reducing the probability that a guest commits to a conflicting engagement, or
- increasing the amount of time between a necessary cancellation and the booking date, and thereby providing more opportunity to recover the lost revenue by opening the room up for rebooking.
Because our mean lead_time is
With respect to those factors that are most likely to only be risk indicators, our recommendations are as follows:
Number of Adults
While this variable may be an influencer as well, since there is a chance we could entice some cases to invite additional people through targeted marketing outreach, it seems unlikely that this would improve the majority of cases, so we will treat it as an indicator.
We recommend using ad campaigns and targeted marketing to highlight local attractions for single travelers, in order to reduce the likelihood of cancellation, but we do not recommend overcommitting to this strategy without some additional analysis.
At minimum, knowing that the size of the party has a significant relationship to the risk of cancellation will help us to provide the hoteliers with information to fuel their utilization projections.
Arrival Month
There is no escaping the holiday season (at least in the USA), but we can dynamically change our overbooking allowance based on prior period booking activity. In other words, we recommend using this information, in combination with the remaining indicators, to structure the overbooking scheme for each hotel in the group. We recommend being initially conservative in this, pending future analyses, but this is another example of a strategy that could be easily testable.
With respect to the remaining variables of interest, we make the following recommendations:
Market Segment
It is unsurprising that complimentary bookings are cancelled with much less frequency than other types, but it is not practical to significantly increase the number of available complimentary bookings. We recommend that where possible, the hoteliers should be given latitude to offer complimentary bookings when the rooms would be otherwise unutilized, since there is a chance that this may lead to future bookings (the repeated_guest effect).
Average Room Price
There is a small positive effect on likelihood of cancellation by changes in price, with each change of
Limitations of the Data
As noted in the Analysis, the data did not include timestamps, only disaggregated dates (year, month, and day, not associated with one another). This means that we cannot analyze the data for any time trends other than seasonality.
The data are also missing sufficient detail to know whether for repeated_guest, how many prior bookings they have had. It would be useful to know this information, since it could affect the risk profile assessment directly.
We also do not know what particular features are implied by each room_type_reserved or type_of_meal_plan, so it is difficult to make specific recommendations.
If future analysis is performed in order to delve more deeply into some of the more influential factors, we recommend including the omitted information in the data.
Appendix
Plot 1: Histograms for Inspection
Plot 2: Scatterplots for Inspection
Weight of Evidence Transformations (Mappings)
Room Type
-- ROOM TYPE TRANSFORMATION MAP --
SELECT room_type_reserved, room_type_reserved_woe, COUNT(room_type_reserved) AS count, ROUND(COUNT(room_type_reserved) / ROUND(
(
SELECT COUNT(*)
FROM 'data/hotel_bookings_final.csv'
),4),4)
AS proportion
FROM 'data/hotel_bookings_final.csv'
GROUP BY room_type_reserved, room_type_reserved_woe
ORDER BY count DESC;Meal Plan
-- MEAL PLAN TRANSFORMATION MAP --
SELECT DISTINCT type_of_meal_plan, type_of_meal_plan_woe, COUNT(type_of_meal_plan) AS count, ROUND(COUNT(type_of_meal_plan) / ROUND(
(
SELECT COUNT(*)
FROM 'data/hotel_bookings_final.csv'
),4),4)
AS proportion
FROM 'data/hotel_bookings_final.csv'
GROUP BY type_of_meal_plan, type_of_meal_plan_woe
ORDER BY count DESC;Market Segment
-- MARKET SEGMENT TRANSFORMATION MAP --
SELECT DISTINCT market_segment_type, market_segment_type_woe, COUNT(market_segment_type) AS count, ROUND(
COUNT(market_segment_type) / ROUND(
(
SELECT COUNT(*)
FROM 'data/hotel_bookings_final.csv'
),4),4)
AS proportion
FROM 'data/hotel_bookings_final.csv'
GROUP BY market_segment_type, market_segment_type_woe
ORDER BY count DESC;Pre-Post Transformed Datasets
Original Dataset
-- SAMPLE VIEW OF ORIGINAL DATASET --
SELECT * FROM 'data/hotel_bookings.csv'
LIMIT 10;