Predicting Hotel Cancellations - Ben Mathewson

EXECUTIVE SUMMARY

Our analysis of the data indicates that there are several factors, such as lead_time, repeated_visits, no_of_special_requests, and room_type, that the group may be able to influence to help mitigate the risk of cancellation. It does appear that increasing the number of weekend nights increases the likelihood of cancellation, but when we consider that we do not have information about whether the cancellations disproportionately involve weekend nights (since the data do not specify, and any stay longer than two days must include weekdays as well), we will reserve recommending any actions until additional studies are performed.

All other recommendations, available in the Recommendations section at the end of this report, include specific suggestions for each variable of interest. Additionally, we have taken care to acknowledge weaknesses of the data, and to make recommendations for future research directions.

We have constructed a model that predicts the likelihood of a cancellation to more than accuracy (based on this data), and which is very straightforward in both its implementation and interpretation. The model would depend in part on future data, so that it could be continually trained and improved upon as conditions evolve. Implementing a risk assessment tool to help hoteliers identify high-risk bookings and make better-informed decisions about resource allocation would be a simple integration, depending on the current bookings management system(s) in use.

The following report organizes our findings and recommendations by section, according to the needs of the reader. For high-level summary, please direct your attention to the Introduction and Conclusion sections. For details on methodology and data treatment, please see Data Cleaning and Model Selection, followed by a look through the Analysis. For additional information, or to inspect the code, please see the Appendix.

INTRODUCTION

We have been retained to support a hotel with a project aimed to increase revenue from their room bookings. The hotel group has asked us to use our skills to investigate which factors affect whether a guest will cancel their booking(s) and to produce recommendations on how to reduce the likelihood of a cancellation. In order to fulfill this request, we will use a combination of SQL and Python tools to compile, clean, describe, and analyze the data provided.

All code related to the analysis is included in the appendix, along with plots and visualizations referenced in, but not included in, the main body of the report. Snapshots of the data as it progressed from ingestion through the pipeline to analysis are also available for inspection in the appendix.

Our report will provide detailed information on the analysis process, as well as a set of recommendations based on the data to assist the hotel group in determining how best to optimize their bookings' outcomes.

DATA CLEANING AND MODEL SELECTION

Initial Inspection and Findings

We were provided a raw data set by the hotel group, which we inspected using pandas (see attached code) in Python in conjunction with SQL. The original data set consists of entries across mixed-type variables (see Table 1: Data Description and Summary). The original dataset included rows containing null or missing values.

Plots of categorical distributions are made for initial inspection (see the Appendix, or Plot 1: Histograms for Inspection). Each quantitative variable is plotted against every other to inspect for significant interactions (see the Appendix, or Plot 2: Scatterplots for Inspection).

Table 1: Data Description and Summary

Column (Variable)	Description	Type	Null/Missing
Booking_ID	unique identifier of each booking	String	0
no_of_adults	Number of adults	Integer	413
no_of_children	Number of Children	Integer	324
no_of_weekend_nights	Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel	Integer	367
no_of_week_nights	Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel	Integer	807
type_of_meal_plan	Type of meal plan booked by the customer	String	526
required_car_parking_space	Does the customer require a car parking space?	Boolean	2592
room_type_reserved	Type of room reserved by the customer	String	1171
lead_time	Number of days between the date of booking and the arrival date	Integer	472
arrival_year	Year of arrival date	Integer	378
arrival_month	Month of arrival date	Integer	504
arrival_date	Day of arrival date	Integer	981
market_segment_type	Market segment designation	String	1512
repeated_guest	Is the customer a repeated guest?	Boolean	586
no_of_previous_cancellations	Number of previous bookings that were canceled by the customer prior to the current booking	Integer	497
no_of_previous_bookings_not_canceled	Number of previous bookings not canceled by the customer prior to the current booking	Integer	550
avg_price_per_room	Average price per day of the reservation; prices of the rooms are dynamic	Float	460
no_of_special_requests	Total number of special requests made by the customer	Integer	789
booking_status	Flag indicating if the booking was canceled or not	Boolean	0

Dealing with Interaction Terms

To measure significance of possible interactions between variables, we compute the Pearson Moment Correlation Coefficient, Variance inflation Factor, and p-Value for each pair. Weak correlations appear to exist between no_previous_cancellations and no_previous_bookings_not_cancelled as well as between no_of_children and room_type_reserved.

The relationships between these variables is not strong enough to exclude them from initial analysis (). There are also reasonable explanatory hypotheses for each relationship, since families tend to book rooms based on capacity and number of beds, and it is plausible that both the number of cancellations and the number of fulfilled visits (non-cancellations) depend on total number of bookings. Without information on the total number of prior bookings by client, we cannot make any authoritative conclusions, so it is reasonable to proceed to regression before we eliminate further columns.

Model and Final Variable Selection

Given that the response column, booking_status, is binary, we choose logistic regression for primary mode of analysis. We apply a binary value mapping to booking_status and year so that Cancellation and correspond to while Non-Cancellation and correspond to . Next we imputed the values of type_of_meal_plan, room_type_reserved, and market_segment_type to continuous variables using the Weight of Evidence method (Source).

Regression analysis was iteratively performed using the statsmodels.api package, with the worst-performing (in terms of p-value) independent variable removed at each successive iteration. In this way we remove those variables that do not significantly influence the outcome without overlooking possible hidden effects (see Table 3: Final Variable Selections (Iterative Regressions)).

To deal with the problem of the null values, we removed first the required_car_parking_space column, and then all remaining rows with nulls. Given that preliminary regression demonstrated that required_car_parking_space was insignificant to the overall model outcomes, and that after removing it, only rows containing null or missing values remained ( of the original ), so that we preserved as much complete data ( rows or ) as possible while still removing all rows containing null values.

Summary of Cleansed and Imputed Data

To convert categorical variables to quantitative for regression, the Weight of Evidence method was chosen, since it produces intermediate values (log-odds) that are approximately normally distributed (See Plot 6 below) and are meaningful in terms of representing the effect each subcategory has on outcomes.

Plot 6: Distribution of Log-Odds from Final Model

(NOTE: The extreme low outliers that you notice above are those bookings in the Complimentary category of the market_segment variable, which while they carry a large weight, are so rare that they do not make us question our assessment that log-odds are a normally distributed variable.)

The resulting cleansed and imputed data set has entries across variables (imputation on type_of_meal_plan, room_type_reserved, market_segment_type produced three new columns, while removal of required_car_parking_space and Booking_ID reduced the original dimension by two).

The resulting data values used to build the final model are listed here, in Table 3: Final Variable Selections:

Table 3: Final Variable Selections (Iterative Regressions)

Since the coefficients do not significantly change as poorly performing variables are removed, we can be confident that we are not overlooking significant interactions or changing the model by applying the successive elimination method.

After first run, we removed no_of_children ()
After second run, we removed no_of_previous_bookings_not_canceled ()
After third run, we removed no_of_previous_cancellations ()

Explanation of Intermediate Model Values and their Effects

From the original set of variables we select the final variables based on their significance and model effects (see Table 2: Model Weights and Effects (Summary and Range)). Our resulting predictive model, which has no meaningful intercept, is given as

where the random variable is the linear sum

(See Plot 6 above for the distribution of )

If the probability of cancellation is less than (i.e., ), the outcome is interpreted as a predicted cancellation, and if it is greater than or equal to it is interpreted as a predicted non-cancellation (fulfillment). You may see a complete portrait of the comparison between the observed and predicted results in Plot 5 below.

Using these parameter weights, we are able to construct the following table to illustrate the extreme and midpoint values for each parameter (see the table below). This enables us to compare the effect that each level of change on a given parameter has on the model outcome. In other words, this is our basis for determining which model parameters are the most significant to determining whether a guest will or will not cancel their reservation.

NOTE: While some effects have much greater (in magnitude) possible input values, whether this translates to greater effect depends in part on the coefficient weight. For example, market_segment_type_woe (the quantitative transformation that assigns a meaningful numeric value to market segment category) has a minimum input (lower extreme) of , and a parameter weight of , giving it a strong effect on the outcome. By comparison, while the ranges of avg_price_per_room and lead_time are greater, their parameter weights are much lower, so that the range of output values for those terms is actually smaller than for market_segment_type_woe. For this reason, it is important to focus on the summary of possible output values when comparing effects.

Table 2: Model Weights and Effects (Summary and Range)

Factor	Weight	Min. Input	Max. Input	Median Input	Min. Output	Max. Output	Median Output
no_of_adults	-0.4752	0	4	2	-1.9008	0.0000	-0.9504
no_of_weekend_nights	0.0636	0	6	1	0.0000	0.3816	0.0636
no_of_week_nights	-0.0362	0	17	2	-0.6154	0.0000	-0.0724
lead_time	0.0115	0	443	57	0.0000	5.0945	0.6555
arrival_year	-0.9236	0	1	0	-0.9236	0.0000	0.0000
arrival_month	-0.0763	1	12	8	-0.9156	-0.0763	-0.6104
arrival_date	-0.0166	1	31	16	-0.5146	-0.0166	-0.2656
repeated_guest	-1.7944	0	1	0	-1.7944	0.0000	0.0000
avg_price_per_room	0.0067	0	540	99.45	0.0000	3.6180	0.6663
no_of_special_requests	-1.2029	0	5	0	-6.0145	0.0000	0.0000
type_of_meal_plan_woe	0.2615	-0.6671	0.5503	-0.0721	-0.1744	0.1439	-0.0189
room_type_reserved_woe	1.2358	-0.5187	0.4219	-0.0240	-0.6410	0.5214	-0.0297
market_segment_type_woe	2.1180	-21.1703	0.1677	0.1677	-44.8387	0.3552	0.3552

Validating and Testing the Model

By applying the model to the existing data and comparing the data results with the predicted model outcomes (see Plot 4), we find that we have an overall true-positive rate of , measured as

Plot 4: Observed vs. Predicted Outcomes by Frequency

which is consistent with the regression model obtained by sklearn.linear_model to the data set (see Plot 3).

Plot 3: Receiver Operating Characteristic for Trained Model

We also plot the regression probabilities by linear sums of the model outcomes (log-odds) to inspect the distribution of outcomes (see Plot 5).

Plot 5: Logistic Comparison of Model Predictions vs. Observed Results

Our threshold is set to so that probabilities less than correspond to ‘Not_Canceled’ predictions and greater than or equal to corresponds to ‘Canceled’ predictions. Since produced the highest true positive rate, so we accept this value as the optimal threshold for our classification model.

ANALYSIS

Now that we have confirmed the accuracy of our model, we can confidently use its parameter values to investigate the effects that the various observed factors exert on actual outcomes. Based on our table of weights and coefficients (Table 2: Model Weights and Effects (Summary and Range)), we can model multiple scenarios to determine resulting effects on the likelihood of cancellation. The magnitude of each effect depends on both the size of the coefficient and the size of the input parameter. For our binary variables, this is either equal to zero or the size of the coefficient, but for our imputed and continuous variables, we will compare (more or less) continuous changes, based on differing combinations of inputs and coefficients.

By applying this approach to all variables, we find that the expected value for the log-odds

where represents the median input value for each parameter (), gives us a baseline probability of cancellation () of .

That is, we are estimating the expected probability of cancellation under our model, using the median coefficient values for each parameter, at about . This is skewed slightly more in favor of cancellation than our observations, since the actual proportion of cancellations is . Despite that fact, we can use this as a reasonable benchmark to compare the effects exerted by each parameter, since we are more interested in relative effects than overall outcomes. The table below summarizes the observed proportions and counts.

Table of Proportions and Counts, pre- and post- Cleaning

Discussion and Recommendations by Variable

The following is an enumeration and discussion of the variables of interest that were identified by observing their likely values, magnitude of effect, and potential for improvement through targeted intervention.

Number of Adults (no_of_adults)

The number of adults on the reservation is important to the outcome, with the risk of cancellation for singles greater than for doubles. Given that we have no information to determine whether this feature is defined as the number of adults per reservation or per room, we will refrain from making detailed recommendations. It is still useful to know that single reservations are more likely to cancel than doubles.

We recommend the hotel group take actions to market their hotel experience to individuals by using targeted community advertising, special pricing and promotions, and direct marketing to help individuals reimagine their hotel stay experience.

Number of Weekend Nights in the Booking (no_of_weekend_nights)

The number of weekend nights included in each booking ranges from in our data, with the probability of cancellation increasing by for each increase of weekend night. The most common bookings in this category include no weekend nights, with and nights sharing about equal likelihood (see Plot 7 below).

Plot 7: Distribution of Weekend Nights Booked

It is unclear whether this effect is impacted by seasonality, or whether it is autocorrelated, since we only know whether a night was a weekend night, and do not know when that night occurred. We recommend a second look at this feature, using a data set wherein the date-time stamps are preserved, in order to better assess for time-series effects.

Given that weekends seem to increase the likelihood of cancellation, we propose that the hotel group consider promoting special offers and price incentives for weekend nights included in the booking. By offering special discounts for just weekend nights, it may be possible to offset the revenue impacts of increased cancellation.

That said, this effect is certainly not the weightiest in terms of overall model outcomes, so that it may not be particularly beneficial to invest too much into mitigation strategies centered on it.

Number of Weeknights in the Booking (no_of_week_nights)

Similarly to the number of weekend nights, the effect on model outcomes exerted by number of weeknights is limited, and since there is a maximum of weeknights for every weekend nights, these two features effectively balance one another out.

You can see in the distribution of weeknights below that the majority of bookings include weeknights, which is not unexpected. Within this range, the expected change in likelihood of cancellation is , so that we can safely avoid spending too much effort optimizing for this effect.

Plot 8: Distribution of Weeknights Booked

We do however recommend the inclusion of time-stamp data in future studies, so that we can inspect for patterns that may be missed in our current disaggregated data.

Lead Time in Days (lead_time)

As lead_time increases, so does the probability of cancellation. Holding all other factors constant (at median levels), we see that a change of days to lead time is associated with just over change in likelihood of cancellation. Given that the median overall lead time is days, we recommend that the hoteliers implement a flagging system (automatic) that will begin to send reservation reminders to each guest when they are within days of their reservation.

Plot 9: Distribution of Lead Times

Since the chance of a conflict emerging or change in plans is likely to increase with time, this is a reasonable finding. We recommend that the reminder system be structured to send reminders to each guest automatically when they are within , , and days of their reservation.

The reminders should emphasize the cancellation policy so that the guest is able to cancel as early as possible if they determine the need to cancel at all. The reasoning here is that by getting the guest to cancel early, we can return the room to the available pool and try to secure another reservation in a reasonable window. Since the median lead time is days, it is reasonable to assume that our best chance of rebooking happens around that time.

Additionally, if the reservation is cancelled with fewer than days notice, it may be possible to convert the room to a complimentary room to entice future business from frequent guests. This is a more complex recommendation, since it implies the existence of a rewards program, and it entails some present cost that would have to be offset by future revenue.

Arrival Month (arrival_month)

It is not surprising that cancellation rates are lower in the winter months, since these are times during which holiday celebrations increase the demand for hotel rooms, and reservations seem generally 'stickier' from October through December. According to our model, each change of one month corresponds to a change in cancellation likelihood of . Holding all other factors at their median values, if the reservation is in January, there is a model probability of cancellation, but that probability drops to by December.

Given that this association is likely due to factors entirely outside the control of the hotelier, it may be more useful to use this information as an invitation to study seasonal effects on 'stickiness', volume, and capacity. It may be beneficial to use the seasonal ebb and flow as a basis for scheduling large-scale maintenance and upkeep procedures.

In other words, with respect to the arrival month, it may be most beneficial to use this information to predict when the best time to accomplish necessary tasks that put strain on staff and limit the availability of some spaces within the hotel, rather than attempting to use this information to directly reduce cancellation rates.

Repeated Guest (repeated_guest)

Whether a guest has previously stayed at the hotel seems to be a particularly weighty factor in estimating their likelihood of cancellation. Since this is a binary variable, we do not know how often they stayed in the past, but the model prediction is that if a guest has stayed at the hotel previously, the likelihood that the current booking will be cancelled is reduced by as much as .

There are a number of possible reasons for this finding, such as geolocation, market segment, professional association, and even personality. That is,

if the hotel is located near an attraction or location to which the guest will travel periodically, such as a family home, conference, or event, it is plausible that the fact that they are repeat guests is coincidental to location;
if the hotel is booked by a sponsor or employer on behalf of the guest, in the case of a professional conference, or as part of a travel cycle, it is possible that the repeat behavior is due to factors other than personal experience;
if the same set of guests are responsible for this repeated behavior, it may not be inferentially sound to assume that simply having stayed at the hotel before is a predictor for non-cancellation in the future.

We recommend a more detailed look at a data set that includes individual demographics (anonymized), in order to probe more deeply into this effect. Given the magnitude, it seems worthwhile to determine whether there is a causal relationship.

In particular, it will be very useful to analyze the number of times they have been a guest, what their market segment has been, during what time(s) of year they arrive, and whether they typically travel alone or as part of a group.

Average Room Price per Night (avg_price_per_room)

The average price per room overall is dollars, from the data. Each change of dollars is associated with a change in likelihood of cancellation. That is, if the price per room for a given guest was dollars, our model would have predicted a greater likelihood of cancellation.

We recommend that where possible, sales and discounts be offered to guests who are already at higher risk of cancelling, to help mitigate the overall risk. Additionally, we recommend a more detailed study of the financial impacts of cancellations versus lowering prices, in order to determine how best to act on this finding.

Number of Special Requests (special_requests)

The number of special requests exerts a strong effect on the outcome, with the model predicting that even a single special request could reduce the likelihood of a cancellation by up to , while two or more appear to reduce the likelihood by more than . As in the case of repeated_guest, it will be important to perform more detailed analysis to understand underlying relationships, such as what types of requests have the greatest effect. We recommend a deeper dive into this effect, but it does seem clear that adding any special request to the booking can greatly reduce the likelihood of cancellation.

It may be prudent for hotel management to implement policies that make it easier for individuals to include special requests in their bookings. While there are questions remaining around the specific nature of the requests, the effect is uncorrelated to the other independent variables under consideration. Implementing a system of customized recommendations to improve the stickiness of the reservation seems to be a relatively low-cost strategy, and the effects of such an intervention would be easily testable, since the booking population could be sampled at random and tested using a simple A/B framework.

Type of Meal Plan (meal_plan)

Meal plan is less influential to outcomes, with the likelihood varying between the least common choices (plans 2 and 3, which make up and , respectively), and negligibly between the remaining choices. Where possible, we recommend promoting Meal Plan 3, since that plan is associated with a lower probability of cancellation. Whether this recommendation is practical or actionable can't be effectively determined without knowing more about the available choices, so we recommend at least a cursory follow-on examination of meal plan types to determine whether it is worthwhile to manipulate this variable.

Room Type (room_type)

Room type has a significant effect on the likelihood of cancellation, but the room types are encoded, so that we do not know what particular characteristics may account for different levels of effect.

Room_Type 1 makes up of all bookings, and is the median type, with Room_Type 7 ( of bookings) decreasing the likelihood of cancellation by and Room_Type 6 ( of bookings) increasing it by the same amount. Since both of these types are relatively rare, it seems necessary to delve into the expanded descriptions before making recommendations.

Market Segment (market_segment_type)

Given that the weight of evidence () of the Complimentary subcategory of the market_segment_type is so large () and it has the strongest positive coefficient (), we can expect a large effect exerted by changes in this variable. It is unsurprising that the Complimentary category is associated with the lowest risk of cancellation (near ).

Changing market_segment_type form Complimentary to Online () increases the model probability of a cancellation by . If the bookings are Corporate the likelihood drops by , and if they belong to either the Offline or Aviation categories, the likelihood drops by .

Given that complimentary bookings make up less than of bookings, while online bookings make up about (See 'Weight of Evidence Transformations' in the Appendix), it is unrealistic to simply offer more complimentary bookings in order to lower the rate of cancellations. Also, given that reducing the volume of online bookings is likely to have unpredictable and possibly negative impacts on revenue, any proposal should be based on the assumption that online bookings are the dominant mode of consumer engagement for the foreseeable future.

It is our recommmendation that interventions to this effect focus on developing incentives and reminder systems to improve the stickiness around online bookings, rather than directly changing current booking systems.

Conclusion and Recommendations

Our final model, when tested against the data, demonstrated a high degree of predictive accuracy on the outcomes. Based on the parameter values, we analyzed the effects that each feature has on whether or not a guest may cancel their booking ( ). We found that several factors influence the likelihood of a guest cancelling their booking. The most influential effects are summarized, along with their corresponding values, in the table below.

Recommendations

Because it is difficult to ascertain whether certain effects are indicators of increased cancellation risk or whether they are influencers, we will restrict our recommendations to identifying those reservations that are at higher risk of cancellation, so that the hotelier(s) have the best chance at either converting a cancellation to a rebooking, or preventing a cancellation through systematic direct marketing approaches.

We will first look at those factors where the hotel group may be able to directly influence outcomes, followed by a discussion of how to optimize outcomes for those bookings that cannot have their risk profile improved.

Repeat Visits

Given that repeated guests are less likely to cancel than new guests, we recommend lowering the barrier to entry for first-time customers. Where possible, when a first-time guest books a room, it may be worthwhile to solicit enough information to decide whether they are likely to be a repeat customer (i.e., is this a business trip stopover, family or friend location, favorite vacation destination, etc.?). If they are likely to be frequent visitors to the region, it may be worthwhile to offer them generous incentives to ensure that the first visit anchors them.

Special Requests

We cannot be fully certain whether special_requests is an indicator or and influencer, but it would be relatively simple to implement a recommender system that would convert basic demographic information into personal recommendations that might encourage special requests. This may include particular amenities such as views, access to restaurants and shops, or even local taxi and shuttle services. It would be simple to collect data to analyze whether special requests cause more stickiness with the booking, or whether they simply indicate it.

Room Type

Where possible, it may be useful to allocate rooms types associated with lower risk to those guests who carry inherent risk in their profile, such as single guests or new guests.

Lead Time

While we cannot change when guests make their reservations, we can implement automated systems that remind guests of their reservations at fixed intervals, to try to mitigate the risk of booking conflicts. We recommend sending reminders to the guest at , , and day intervals from the date of stay. This may have the effect of

reducing the probability that a guest commits to a conflicting engagement, or
increasing the amount of time between a necessary cancellation and the booking date, and thereby providing more opportunity to recover the lost revenue by opening the room up for rebooking.

Because our mean lead_time is days, knowing whether a guest will cancel prior to that time may optimize our chances of rebooking the room.

With respect to those factors that are most likely to only be risk indicators, our recommendations are as follows:

Number of Adults

While this variable may be an influencer as well, since there is a chance we could entice some cases to invite additional people through targeted marketing outreach, it seems unlikely that this would improve the majority of cases, so we will treat it as an indicator.

We recommend using ad campaigns and targeted marketing to highlight local attractions for single travelers, in order to reduce the likelihood of cancellation, but we do not recommend overcommitting to this strategy without some additional analysis.

At minimum, knowing that the size of the party has a significant relationship to the risk of cancellation will help us to provide the hoteliers with information to fuel their utilization projections.

Arrival Month

There is no escaping the holiday season (at least in the USA), but we can dynamically change our overbooking allowance based on prior period booking activity. In other words, we recommend using this information, in combination with the remaining indicators, to structure the overbooking scheme for each hotel in the group. We recommend being initially conservative in this, pending future analyses, but this is another example of a strategy that could be easily testable.

With respect to the remaining variables of interest, we make the following recommendations:

Market Segment

It is unsurprising that complimentary bookings are cancelled with much less frequency than other types, but it is not practical to significantly increase the number of available complimentary bookings. We recommend that where possible, the hoteliers should be given latitude to offer complimentary bookings when the rooms would be otherwise unutilized, since there is a chance that this may lead to future bookings (the repeated_guest effect).

Average Room Price

There is a small positive effect on likelihood of cancellation by changes in price, with each change of dollars in price associated with a change in likelihood. However, since it is impractical to use price as the enticement in all cases, we recommend focusing on offering promotions and deals where possible, and sometimes as targeted enticements to very high-risk bookings, to help influence their decision not to cancel. We also recommend a focused cost-benefit analysis of this factor. Knowing the potential revenue tradeoff would enhance the group's ability to implement targeted strategies around this factor.

Limitations of the Data

As noted in the Analysis, the data did not include timestamps, only disaggregated dates (year, month, and day, not associated with one another). This means that we cannot analyze the data for any time trends other than seasonality.

The data are also missing sufficient detail to know whether for repeated_guest, how many prior bookings they have had. It would be useful to know this information, since it could affect the risk profile assessment directly.

We also do not know what particular features are implied by each room_type_reserved or type_of_meal_plan, so it is difficult to make specific recommendations.

If future analysis is performed in order to delve more deeply into some of the more influential factors, we recommend including the omitted information in the data.

Appendix

Plot 1: Histograms for Inspection

Plot 2: Scatterplots for Inspection

Weight of Evidence Transformations (Mappings)

Room Type

DataFrameas

room_type_map

variable

-- ROOM TYPE TRANSFORMATION MAP --

SELECT room_type_reserved, room_type_reserved_woe, COUNT(room_type_reserved) AS count, ROUND(COUNT(room_type_reserved) / ROUND(
			(
			SELECT COUNT(*) 
			FROM 'data/hotel_bookings_final.csv'
			),4),4) 
			AS proportion
FROM 'data/hotel_bookings_final.csv'
GROUP BY room_type_reserved, room_type_reserved_woe
ORDER BY count DESC;

Meal Plan

DataFrameas

meal_plan_map

variable

-- MEAL PLAN TRANSFORMATION MAP --

SELECT DISTINCT type_of_meal_plan, type_of_meal_plan_woe, COUNT(type_of_meal_plan) AS count, ROUND(COUNT(type_of_meal_plan) / ROUND(
			(
			SELECT COUNT(*) 
			FROM 'data/hotel_bookings_final.csv'
			),4),4) 
			AS proportion
FROM 'data/hotel_bookings_final.csv'
GROUP BY type_of_meal_plan, type_of_meal_plan_woe
ORDER BY count DESC;

Market Segment

DataFrameas

mkt_seg_map

variable

-- MARKET SEGMENT TRANSFORMATION MAP --

SELECT DISTINCT market_segment_type, market_segment_type_woe, COUNT(market_segment_type) AS count, ROUND(
		COUNT(market_segment_type) / ROUND(
			(
			SELECT COUNT(*) 
			FROM 'data/hotel_bookings_final.csv'
			),4),4) 
			AS proportion
FROM 'data/hotel_bookings_final.csv'
GROUP BY market_segment_type, market_segment_type_woe
ORDER BY count DESC;

Pre-Post Transformed Datasets

Original Dataset

DataFrameas

sql_original

variable

-- SAMPLE VIEW OF ORIGINAL DATASET --

SELECT * FROM 'data/hotel_bookings.csv'
LIMIT 10;

‌
‌
‌

Predicting Hotel Cancellations - Ben Mathewson

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}EXECUTIVE SUMMARY

INTRODUCTION

DATA CLEANING AND MODEL SELECTION

Initial Inspection and Findings

Table 1: Data Description and Summary

Dealing with Interaction Terms

Model and Final Variable Selection

Summary of Cleansed and Imputed Data

Plot 6: Distribution of Log-Odds from Final Model

Table 3: Final Variable Selections (Iterative Regressions)

Explanation of Intermediate Model Values and their Effects

Table 2: Model Weights and Effects (Summary and Range)

Validating and Testing the Model

Plot 4: Observed vs. Predicted Outcomes by Frequency

Plot 3: Receiver Operating Characteristic for Trained Model

Plot 5: Logistic Comparison of Model Predictions vs. Observed Results

ANALYSIS

Discussion and Recommendations by Variable

Number of Adults (no_of_adults)

Number of Weekend Nights in the Booking (no_of_weekend_nights)

Number of Weeknights in the Booking (no_of_week_nights)

Lead Time in Days (lead_time)

Arrival Month (arrival_month)

Repeated Guest (repeated_guest)

Average Room Price per Night (avg_price_per_room)

Number of Special Requests (special_requests)

Type of Meal Plan (meal_plan)

Room Type (room_type)

Market Segment (market_segment_type)

Conclusion and Recommendations

Recommendations

Repeat Visits

Special Requests

Room Type

Lead Time

Number of Adults

Arrival Month

Market Segment

Average Room Price

Limitations of the Data

Appendix

Plot 1: Histograms for Inspection

Plot 2: Scatterplots for Inspection

Weight of Evidence Transformations (Mappings)

Room Type

Meal Plan

Market Segment

Pre-Post Transformed Datasets

Original Dataset

EXECUTIVE SUMMARY