Skip to content
Machine Learning with XGBoost in Python
  • AI Chat
  • Code
  • Report
  • Machine Learning with XGboost in Python

    Welcome to this code-along, where we will use XGBoost to predict booking cancellations with gradient boosting, a powerful machine learning technique! Through this, you'll learn how to create, evaluate, and tune XGBoost models efficiently. There will be time to answer any questions, so please add them!

    The Dataset

    The session's dataset is a CSV file named hotel_bookings_clean.csv, which contains data on hotel bookings.

    Acknowledgements

    The dataset was downloaded on Kaggle. The data is originally from an article called Hotel booking demand datasets by Nuno Antonio, Ana de Almeida, and Luis Nunes. It was then cleaned by Thomas Mock and Antoine Bichat for #TidyTuesday during the week of February 11th, 2020. For the purposes of this code-along, it was further pre-processed to have cleaner ready-to-use features (e.g., dropping irrelevant columns, one-hot-encoding). The dataset has the following license.

    Data Dictionary

    It contains the 53 columns:

    For binary variables: 1 = true and 0 = false.

    Target
    • is_canceled: Binary variable indicating whether a booking was canceled
    Features
    • lead time: Number of days between booking date and arrival date
    • arrival_date_week_number, arrival_date_day_of_month, arrival_date_month: Week number, day date, and month number of arrival date
    • stays_in_weekend_nights, stays_in_week_nights: Number of weekend nights (Saturday and Sunday) and weeknights (Monday to Friday) the customer booked
    • adults,children,babies: Number of adults, children, babies booked for the stay
    • is_repeated_guest: Binary variable indicating whether the customer was a repeat guest
    • previous_cancellations: Number of prior bookings that were canceled by the customer
    • previous_bookings_not_canceled: Number of prior bookings that were not canceled by the customer
    • required_car_parking_spaces: Number of parking spaces requested by the customer
    • total_of_special_requests: Number of special requests made by the customer
    • avg_daily_rate: Average daily rate, as defined by dividing the sum of all lodging transactions by the total number of staying nights
    • booked_by_company: Binary variable indicating whether a company booked the booking
    • booked_by_agent: Binary variable indicating whether an agent booked the booking
    • hotel_City: Binary variable indicating whether the booked hotel is a "City Hotel"
    • hotel_Resort: Binary variable indicating whether the booked hotel is a "Resort Hotel"
    • meal_BB: Binary variable indicating whether a bed & breakfast meal was booked
    • meal_HB: Binary variable indicating whether a half board meal was booked
    • meal_FB: Binary variable indicating whether a full board meal was booked
    • meal_No_meal: Binary variable indicating whether there was no meal package booked
    • market_segment_Aviation, market_segment_Complementary, market_segment_Corporate, market_segment_Direct, market_segment_Groups, market_segment_Offline_TA_TO, market_segment_Online_TA, market_segment_Undefined: Indicates market segment designation with a value of 1. "TA"= travel agent, "TO"= tour operators
    • distribution_channel_Corporate, distribution_channel_Direct, distribution_channel_GDS, distribution_channel_TA_TO, distribution_channel_Undefined: Indicates booking distribution channel with a value of 1. "TA"= travel agent, "TO"= tour operators, "GDS" = Global Distribution System
    • reserved_room_type_A,reserved_room_type_B, reserved_room_type_C,reserved_room_type_D, reserved_room_type_E, reserved_room_type_F, reserved_room_type_G, reserved_room_type_H, reserved_room_type_L: Indicates code of room type reserved with a value of 1. Code is presented instead of designation for anonymity reasons
    • deposit_type_No_Deposit: Binary variable indicating whether a deposit was made
    • deposit_type_Non_Refund: Binary variable indicating whether a deposit was made in the value of the total stay cost
    • deposit_type_Refundable: Binary variable indicating whether a deposit was made with a value under the total stay cost
    • customer_type_Contract: Binary variable indicating whether the booking has an allotment or other type of contract associated to it
    • customer_type_Group: Binary variable indicating whether the booking is associated to a group
    • customer_type_Transient: Binary variable indicating whether the booking is not part of a group or contract, and is not associated to other transient booking
    • customer_type_Transient-Party: Binary variable indicating whether the booking is transient, but is associated to at least another transient booking

    1. Getting to know our data

    Let's get to know our columns and split our data into features and labels!

    # Import libraries
    import pandas as pd
    import xgboost as xgb # XGBoost typically uses the alias "xgb"
    import numpy as np
    # Read in the dataset
    bookings = pd.read_csv('hotel_bookings_clean.csv')
    
    # List out our columns
    bookings.info()

    It looks like we have 53 columns with 119,210 rows. All the datatypes are numeric and ready for use.

    # Take a closer look at column distributions
    bookings.describe()
    # Plot cancellation counts to visualize proportion of not cancelled and cancelled
    bookings['is_canceled'].value_counts().plot(kind='bar')

    Remember for our binary variables, like is_canceled, 1 = true and 0 = false.

    # Get an exact percentage of not cancelled and cancelled
    bookings['is_canceled'].value_counts()/bookings['is_canceled'].count()*100

    Splitting data

    Let's split our label and features so we can get to building models! The first column is our target label is_canceled. The rest are features.

    # Define X and y
    X, y = bookings.iloc[:,1:], bookings.iloc[:,0]

    2. Your First XGBoost Classifier

    XGBoost has a scikit-learn API, which is useful if you want to use different scikit-learn classes and methods on an XGBoost model (e.g.,predict(), fit()). In this section, we'll try the API out with the xgboost.XGBClassifier() class and get a baseline accuracy for the rest of our work. So that our results are reproducible, we'll set the random_state=123.

    As a reminder, gradient boosting sequentially trains weak learners where each weak learner tries to correct its predecessor's mistakes. First, we'll instantiate a simple XGBoost classifier without changing any of the other parameters, and we'll inspect the parameters that we haven't touched.

    from sklearn.model_selection import train_test_split
    
    # Train and test split using sklearn
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=123)
    
    # Instatiate a XGBClassifier 
    xgb_clf = xgb.XGBClassifier(random_state=123)
    
    # Inspect the parameters
    xgb_clf.get_params()

    There's a couple of things to note:

    • The booster parameter is gbtree. This means the weak learners, or boosters, are decision trees in this model. gbtree is the default, and we will keep it this way.
    • The objective function, or loss function, is defined as binary:logistic. The objective function quantifies how far off a prediction is from the actual results. We want to minimize this to have the smallest possible loss. binary:logistic is the default for classifiers. binary:logistic outputs the actual predicted probability of the positive class (in our case, that a booking is cancelled).
    • n_estimators is the number of gradient boosted trees we want in our model. It's equivalent to the number of boosting rounds. For our purposes, we don't want too many boosting rounds, or training will take too long. Let's lower it from 100 to 10.


    • max_depth is the maximum tree depth allowed. Tree depth is the length of the longest path from the root node to a leaf node. Making this too high will give our model more variance, or more potential to overfit. Similar to n_estimators, the more we increase this, the longer our training period will be. Let's keep this at 3.

    • For our eval_metric (evaluation metric for validation data), we will use error as defined by XGBoost documentation:

    Binary classification error rate. It is calculated as #(wrong cases)/#(all cases). For the predictions, the evaluation will regard the instances with prediction value larger than 0.5 as positive instances, and the others as negative instances.

    ## Set n_estimators to 10 
    xgb_clf.set_params(n_estimators=10)
    
    ## Set max_depth to 3
    xgb_clf.set_params(max_depth=3)
    
    # Set the evaluation metric to error
    xgb_clf.set_params(eval_metric='error')
    
    # Fit it to the training set
    xgb_clf.fit(X_train, y_train)
    
    # Predict the labels of the test set
    preds = xgb_clf.predict(X_test)
                       
    xgb_clf.get_params