Skip to content
Competition - predict hotel cancellation
  • AI Chat
  • Code
  • Report
  • Hidden code

    Predicting Hotel Cancellations

    Booking cancellations can significantly affect demand management strategies in the hospitality sector. The internet sees over 140 million bookings annually, with a significant proportion of hotel bookings being made through popular travel websites.

    To overcome the problems caused by booking cancellations, hotels implement rigid cancellation policies, inventory management, and overbooking strategies, which can also have a negative influence on revenue and reputation.

    Once the reservation has been canceled, there is almost nothing to be done and it creates discomfort for many Hotels and Hotel Technology companies. Therefore, predicting reservations which might get canceled and preventing these cancellations will create a surplus revenue for both Hotels and Hotel Technology companies.

    Motivation

    Imagine if there was a way to predict which guests are likely to cancel their hotel bookings. Using Machine Learning with Python, this is possible. By predicting cancellations, hotels can generate additional revenue, improve forecasting accuracy, and reduce uncertainty in business management decisions.

    For those who want to follow a structured approach while working on a machine learning project, this analysis provides a comprehensive guide. It covers the entire process of solving a real-world machine learning project, from understanding the business problem to deploying the model on the cloud.

    1. Description of the project

    • Understanding the Business Problem
    • Data Collection and Understanding
    • Data Exploration
    • Data Preparation
    • Modeling
    • Model Deployment

    1.1 Understanding Business Problem

    The Goal of this project is to Predict the Guests who are likely to Cancel the Hotel Booking using Machine Learning with Python. Therefore, predicting reservations which might get canceled and preventing these cancellations will create a surplus revenue, better forecasts and reduce uncertainty in business management decisions.

    1.2 Data Collection and Understanding

    The business has provided us with their bookings data in a file called hotel_bookings.csv, which contains the following:

    ColumnDescription
    Booking_IDUnique identifier of the booking.
    no_of_adultsThe number of adults.
    no_of_childrenThe number of children.
    no_of_weekend_nightsNumber of weekend nights (Saturday or Sunday).
    no_of_week_nightsNumber of week nights (Monday to Friday).
    type_of_meal_planType of meal plan included in the booking.
    required_car_parking_spaceWhether a car parking space is required.
    room_type_reservedThe type of room reserved.
    lead_timeNumber of days before the arrival date the booking was made.
    arrival_yearYear of arrival.
    arrival_monthMonth of arrival.
    arrival_dateDate of the month for arrival.
    market_segment_typeHow the booking was made.
    repeated_guestWhether the guest has previously stayed at the hotel.
    no_of_previous_cancellationsNumber of previous cancellations.
    no_of_previous_bookings_not_canceledNumber of previous bookings that were canceled.
    avg_price_per_roomAverage price per day of the booking.
    no_of_special_requestsCount of special requests made as part of the booking.
    booking_statusWhether the booking was cancelled or not.

    Source (data has been modified): https://www.kaggle.com/datasets/ahsan81/hotel-reservations-classification-dataset

    1.3. Data Exploration

    In this step, we will apply Exploratory Data Analysis (EDA) to extract insights from the data set to know which features have contributed more in predicting Cancellations by performing Data Analysis using Pandas and Data visualization using Matplotlib & Seaborn. It is always a good practice to understand the data first and try to gather as many insights from it.

    import numpy as np
    import pandas as pd
    import pickle
    import matplotlib.pyplot as plt
    import seaborn as sns
    import plotly.express as px
    from scipy import stats
    import statsmodels.api as sm
    from sklearn import datasets, linear_model
    from sklearn.linear_model import LogisticRegression
    from scipy import stats
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import classification_report,confusion_matrix
    from sklearn.metrics import roc_curve
    from sklearn.metrics import roc_auc_score
    from matplotlib import pyplot
    from sklearn.model_selection import GridSearchCV
    
    pd.options.display.max_columns = 999
    df = pd.read_csv("data/hotel_bookings.csv")
    df.sample(10)
    df.shape

    Descriptive Statistics. Univariate basically tells us how data in each feature is distributed and also tells us about central tendencies like mean, median, and mode.

    df.describe()
    df.info()