Skip to content
Identifying factors that predict hotel cancellation
  • AI Chat
  • Code
  • Report
  • Predicting Hotel Cancellations

    🏨 Background

    You are supporting a hotel with a project aimed to increase revenue from their room bookings. They believe that they can use data science to help them reduce the number of cancellations. This is where you come in!

    They have asked you to use any appropriate methodology to identify what contributes to whether a booking will be fulfilled or cancelled. They intend to use the results of your work to reduce the chance someone cancels their booking.

    The Data

    They have provided you with their bookings data in a file called hotel_bookings.csv, which contains the following:

    ColumnDescription
    Booking_IDUnique identifier of the booking.
    no_of_adultsThe number of adults.
    no_of_childrenThe number of children.
    no_of_weekend_nightsNumber of weekend nights (Saturday or Sunday).
    no_of_week_nightsNumber of week nights (Monday to Friday).
    type_of_meal_planType of meal plan included in the booking.
    required_car_parking_spaceWhether a car parking space is required.
    room_type_reservedThe type of room reserved.
    lead_timeNumber of days before the arrival date the booking was made.
    arrival_yearYear of arrival.
    arrival_monthMonth of arrival.
    arrival_dateDate of the month for arrival.
    market_segment_typeHow the booking was made.
    repeated_guestWhether the guest has previously stayed at the hotel.
    no_of_previous_cancellationsNumber of previous cancellations.
    no_of_previous_bookings_not_canceledNumber of previous bookings that were canceled.
    avg_price_per_roomAverage price per day of the booking.
    no_of_special_requestsCount of special requests made as part of the booking.
    booking_statusWhether the booking was cancelled or not.

    Source (data has been modified): https://www.kaggle.com/datasets/ahsan81/hotel-reservations-classification-dataset

    The Challenge

    • Use your skills to produce recommendations for the hotel on what factors affect whether customers cancel their booking.

    Imports

    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns
    import numpy as np
    
    sns.set_style('whitegrid')
    
    import missingno as msno
    from datetime import datetime
    
    from sklearn.impute import SimpleImputer
    from sklearn.model_selection import train_test_split, cross_validate, cross_val_score, GridSearchCV
    from sklearn.compose import make_column_transformer
    from sklearn.pipeline import make_pipeline
    
    from sklearn.preprocessing import StandardScaler, OneHotEncoder
    from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
    from sklearn.feature_selection import SelectPercentile, mutual_info_regression
    
    # models
    from sklearn.linear_model import LogisticRegression
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.svm import SVC
    from sklearn.neighbors import KNeighborsClassifier
    from sklearn.tree import DecisionTreeClassifier
    from xgboost import XGBClassifier
    hotels = pd.read_csv("data/hotel_bookings.csv")
    print(hotels.shape)
    hotels.head()
    Hidden output

    Preprocessing

    df = hotels.copy()
    
    # drop unique identifier column
    df = df.drop('Booking_ID', axis = 1)

    Null values

    msno.matrix(df);
    nulldf = df.isnull().sum().sort_values(ascending = False).reset_index()
    nulldf.columns = ['feature', 'null_count']
    nulldf = nulldf[nulldf['null_count'] > 0]
    remove = df[nulldf['feature'].values].sum(axis = 1).sort_values().reset_index()
    remove.columns = ['index', 'value']
    remove = remove[remove['value'] == 0]
    remove_indices = remove['index'].values
    df = df[~df.index.isin(remove_indices)]

    Create Date Column