Machine Learning with XGBoost in Python

Machine Learning with XGboost in Python

Welcome to this code-along, where we will use XGBoost to predict booking cancellations with gradient boosting, a powerful machine learning technique! Through this, you'll learn how to create, evaluate, and tune XGBoost models efficiently. There will be time to answer any questions, so please add them!

The Dataset

The session's dataset is a CSV file named hotel_bookings_clean.csv, which contains data on hotel bookings.

Acknowledgements

The dataset was downloaded on Kaggle. The data is originally from an article called Hotel booking demand datasets by Nuno Antonio, Ana de Almeida, and Luis Nunes. It was then cleaned by Thomas Mock and Antoine Bichat for #TidyTuesday during the week of February 11th, 2020. For the purposes of this code-along, it was further pre-processed to have cleaner ready-to-use features (e.g., dropping irrelevant columns, one-hot-encoding). The dataset has the following license.

Data Dictionary

It contains the 53 columns:

For binary variables: 1 = true and 0 = false.

Target

is_canceled: Binary variable indicating whether a booking was canceled

Features

lead time: Number of days between booking date and arrival date
arrival_date_week_number, arrival_date_day_of_month, arrival_date_month: Week number, day date, and month number of arrival date
stays_in_weekend_nights, stays_in_week_nights: Number of weekend nights (Saturday and Sunday) and weeknights (Monday to Friday) the customer booked
adults,children,babies: Number of adults, children, babies booked for the stay
is_repeated_guest: Binary variable indicating whether the customer was a repeat guest
previous_cancellations: Number of prior bookings that were canceled by the customer
previous_bookings_not_canceled: Number of prior bookings that were not canceled by the customer
required_car_parking_spaces: Number of parking spaces requested by the customer
total_of_special_requests: Number of special requests made by the customer
avg_daily_rate: Average daily rate, as defined by dividing the sum of all lodging transactions by the total number of staying nights
booked_by_company: Binary variable indicating whether a company booked the booking
booked_by_agent: Binary variable indicating whether an agent booked the booking
hotel_City: Binary variable indicating whether the booked hotel is a "City Hotel"
hotel_Resort: Binary variable indicating whether the booked hotel is a "Resort Hotel"
meal_BB: Binary variable indicating whether a bed & breakfast meal was booked
meal_HB: Binary variable indicating whether a half board meal was booked
meal_FB: Binary variable indicating whether a full board meal was booked
meal_No_meal: Binary variable indicating whether there was no meal package booked
market_segment_Aviation, market_segment_Complementary, market_segment_Corporate, market_segment_Direct, market_segment_Groups, market_segment_Offline_TA_TO, market_segment_Online_TA, market_segment_Undefined: Indicates market segment designation with a value of 1. "TA"= travel agent, "TO"= tour operators
distribution_channel_Corporate, distribution_channel_Direct, distribution_channel_GDS, distribution_channel_TA_TO, distribution_channel_Undefined: Indicates booking distribution channel with a value of 1. "TA"= travel agent, "TO"= tour operators, "GDS" = Global Distribution System
reserved_room_type_A,reserved_room_type_B, reserved_room_type_C,reserved_room_type_D, reserved_room_type_E, reserved_room_type_F, reserved_room_type_G, reserved_room_type_H, reserved_room_type_L: Indicates code of room type reserved with a value of 1. Code is presented instead of designation for anonymity reasons
deposit_type_No_Deposit: Binary variable indicating whether a deposit was made
deposit_type_Non_Refund: Binary variable indicating whether a deposit was made in the value of the total stay cost
deposit_type_Refundable: Binary variable indicating whether a deposit was made with a value under the total stay cost
customer_type_Contract: Binary variable indicating whether the booking has an allotment or other type of contract associated to it
customer_type_Group: Binary variable indicating whether the booking is associated to a group
customer_type_Transient: Binary variable indicating whether the booking is not part of a group or contract, and is not associated to other transient booking
customer_type_Transient-Party: Binary variable indicating whether the booking is transient, but is associated to at least another transient booking

1. Getting to know our data

Let's get to know our columns and split our data into features and labels!

# Import libraries
import pandas as pd
import xgboost as xgb # XGBoost typically uses the alias "xgb"
import numpy as np

# Read in the dataset
bookings = pd.read_csv('hotel_bookings_clean.csv')

# List out our columns
bookings.info()

It looks like we have 53 columns with 119,210 rows. All the datatypes are numeric and ready for use.

# Take a closer look at column distributions
bookings.describe()

# Plot cancellation counts to visualize proportion of not cancelled and cancelled
bookings['is_canceled'].value_counts().plot(kind='bar')

Remember for our binary variables, like is_canceled, 1 = true and 0 = false.

# Get an exact percentage of not cancelled and cancelled
bookings['is_canceled'].value_counts()/bookings['is_canceled'].count()*100

Splitting data

Let's split our label and features so we can get to building models! The first column is our target label is_canceled. The rest are features.

# Define X and y
X, y = bookings.iloc[:,1:], bookings.iloc[:,0]

2. Your First XGBoost Classifier

XGBoost has a scikit-learn API, which is useful if you want to use different scikit-learn classes and methods on an XGBoost model (e.g.,predict(), fit()). In this section, we'll try the API out with the xgboost.XGBClassifier() class and get a baseline accuracy for the rest of our work. So that our results are reproducible, we'll set the random_state=123.

As a reminder, gradient boosting sequentially trains weak learners where each weak learner tries to correct its predecessor's mistakes. First, we'll instantiate a simple XGBoost classifier without changing any of the other parameters, and we'll inspect the parameters that we haven't touched.

from sklearn.model_selection import train_test_split

# Train and test split using sklearn
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=123)

# Instatiate a XGBClassifier 
xgb_clf = xgb.XGBClassifier(random_state=123)

# Inspect the parameters
xgb_clf.get_params()

There's a couple of things to note:

The booster parameter is gbtree. This means the weak learners, or boosters, are decision trees in this model. gbtree is the default, and we will keep it this way.
The objective function, or loss function, is defined as binary:logistic. The objective function quantifies how far off a prediction is from the actual results. We want to minimize this to have the smallest possible loss. binary:logistic is the default for classifiers. binary:logistic outputs the actual predicted probability of the positive class (in our case, that a booking is cancelled).
n_estimators is the number of gradient boosted trees we want in our model. It's equivalent to the number of boosting rounds. For our purposes, we don't want too many boosting rounds, or training will take too long. Let's lower it from 100 to 10.

max_depth is the maximum tree depth allowed. Tree depth is the length of the longest path from the root node to a leaf node. Making this too high will give our model more variance, or more potential to overfit. Similar to n_estimators, the more we increase this, the longer our training period will be. Let's keep this at 3.

For our eval_metric (evaluation metric for validation data), we will use error as defined by XGBoost documentation:

Binary classification error rate. It is calculated as #(wrong cases)/#(all cases). For the predictions, the evaluation will regard the instances with prediction value larger than 0.5 as positive instances, and the others as negative instances.

## Set n_estimators to 10 
xgb_clf.set_params(n_estimators=10)

## Set max_depth to 3
xgb_clf.set_params(max_depth=3)

# Set the evaluation metric to error
xgb_clf.set_params(eval_metric='error')

# Fit it to the training set
xgb_clf.fit(X_train, y_train)

# Predict the labels of the test set
preds = xgb_clf.predict(X_test)
                   
xgb_clf.get_params

‌
‌
‌