Machine Learning with XGboost in Python
Welcome to this code-along, where we will use XGBoost to predict booking cancellations with gradient boosting, a powerful machine learning technique! Through this, you'll learn how to create, evaluate, and tune XGBoost models efficiently. There will be time to answer any questions, so please add them!
The Dataset
The session's dataset is a CSV file named hotel_bookings_clean.csv
, which contains data on hotel bookings.
Acknowledgements
The dataset was downloaded on Kaggle. The data is originally from an article called Hotel booking demand datasets by Nuno Antonio, Ana de Almeida, and Luis Nunes. It was then cleaned by Thomas Mock and Antoine Bichat for #TidyTuesday during the week of February 11th, 2020. For the purposes of this code-along, it was further pre-processed to have cleaner ready-to-use features (e.g., dropping irrelevant columns, one-hot-encoding). The dataset has the following license.
Data Dictionary
It contains the 53 columns:
For binary variables: 1
= true and 0
= false.
Target
is_canceled
: Binary variable indicating whether a booking was canceled
Features
lead time
: Number of days between booking date and arrival datearrival_date_week_number
,arrival_date_day_of_month
,arrival_date_month
: Week number, day date, and month number of arrival datestays_in_weekend_nights
,stays_in_week_nights
: Number of weekend nights (Saturday and Sunday) and weeknights (Monday to Friday) the customer bookedadults
,children
,babies
: Number of adults, children, babies booked for the stayis_repeated_guest
: Binary variable indicating whether the customer was a repeat guestprevious_cancellations
: Number of prior bookings that were canceled by the customerprevious_bookings_not_canceled
: Number of prior bookings that were not canceled by the customerrequired_car_parking_spaces
: Number of parking spaces requested by the customertotal_of_special_requests
: Number of special requests made by the customeravg_daily_rate
: Average daily rate, as defined by dividing the sum of all lodging transactions by the total number of staying nightsbooked_by_company
: Binary variable indicating whether a company booked the bookingbooked_by_agent
: Binary variable indicating whether an agent booked the bookinghotel_City
: Binary variable indicating whether the booked hotel is a "City Hotel"hotel_Resort
: Binary variable indicating whether the booked hotel is a "Resort Hotel"meal_BB
: Binary variable indicating whether a bed & breakfast meal was bookedmeal_HB
: Binary variable indicating whether a half board meal was bookedmeal_FB
: Binary variable indicating whether a full board meal was bookedmeal_No_meal
: Binary variable indicating whether there was no meal package bookedmarket_segment_Aviation
,market_segment_Complementary
,market_segment_Corporate
,market_segment_Direct
,market_segment_Groups
,market_segment_Offline_TA_TO
,market_segment_Online_TA
,market_segment_Undefined
: Indicates market segment designation with a value of1
. "TA"= travel agent, "TO"= tour operatorsdistribution_channel_Corporate
,distribution_channel_Direct
,distribution_channel_GDS
,distribution_channel_TA_TO
,distribution_channel_Undefined
: Indicates booking distribution channel with a value of1
. "TA"= travel agent, "TO"= tour operators, "GDS" = Global Distribution Systemreserved_room_type_A
,reserved_room_type_B
,reserved_room_type_C
,reserved_room_type_D
,reserved_room_type_E
,reserved_room_type_F
,reserved_room_type_G
,reserved_room_type_H
,reserved_room_type_L
: Indicates code of room type reserved with a value of1
. Code is presented instead of designation for anonymity reasonsdeposit_type_No_Deposit
: Binary variable indicating whether a deposit was madedeposit_type_Non_Refund
: Binary variable indicating whether a deposit was made in the value of the total stay costdeposit_type_Refundable
: Binary variable indicating whether a deposit was made with a value under the total stay costcustomer_type_Contract
: Binary variable indicating whether the booking has an allotment or other type of contract associated to itcustomer_type_Group
: Binary variable indicating whether the booking is associated to a groupcustomer_type_Transient
: Binary variable indicating whether the booking is not part of a group or contract, and is not associated to other transient bookingcustomer_type_Transient-Party
: Binary variable indicating whether the booking is transient, but is associated to at least another transient booking
1. Getting to know our data
Let's get to know our columns and split our data into features and labels!
# Import libraries
import pandas as pd
import xgboost as xgb # XGBoost typically uses the alias "xgb"
import numpy as np
# Read in the dataset
bookings = pd.read_csv('hotel_bookings_clean.csv')
# List out our columns
bookings.info()
It looks like we have 53 columns with 119,210 rows. All the datatypes are numeric and ready for use.
# Take a closer look at column distributions
bookings.describe()
# Plot cancellation counts to visualize proportion of not cancelled and cancelled
bookings['is_canceled'].value_counts().plot(kind='bar')
Remember for our binary variables, like is_canceled
, 1
= true and 0
= false.
# Get an exact percentage of not cancelled and cancelled
bookings['is_canceled'].value_counts()/bookings['is_canceled'].count()*100
Splitting data
Let's split our label and features so we can get to building models! The first column is our target label is_canceled
. The rest are features.
# Define X and y
X, y = bookings.iloc[:,1:], bookings.iloc[:,0]
2. Your First XGBoost Classifier
XGBoost has a scikit-learn API, which is useful if you want to use different scikit-learn classes and methods on an XGBoost model (e.g.,predict()
, fit()
). In this section, we'll try the API out with the xgboost.XGBClassifier()
class and get a baseline accuracy for the rest of our work. So that our results are reproducible, we'll set the random_state=123
.
As a reminder, gradient boosting sequentially trains weak learners where each weak learner tries to correct its predecessor's mistakes. First, we'll instantiate a simple XGBoost classifier without changing any of the other parameters, and we'll inspect the parameters that we haven't touched.
from sklearn.model_selection import train_test_split
# Train and test split using sklearn
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=123)
# Instatiate a XGBClassifier
xgb_clf = xgb.XGBClassifier(random_state=123)
# Inspect the parameters
xgb_clf.get_params()
There's a couple of things to note:
- The
booster
parameter isgbtree
. This means the weak learners, or boosters, are decision trees in this model.gbtree
is the default, and we will keep it this way. - The
objective
function, or loss function, is defined asbinary:logistic
. The objective function quantifies how far off a prediction is from the actual results. We want to minimize this to have the smallest possible loss.binary:logistic
is the default for classifiers.binary:logistic
outputs the actual predicted probability of the positive class (in our case, that a booking is cancelled). n_estimators
is the number of gradient boosted trees we want in our model. It's equivalent to the number of boosting rounds. For our purposes, we don't want too many boosting rounds, or training will take too long. Let's lower it from 100 to 10.
max_depth
is the maximum tree depth allowed. Tree depth is the length of the longest path from the root node to a leaf node. Making this too high will give our model more variance, or more potential to overfit. Similar ton_estimators
, the more we increase this, the longer our training period will be. Let's keep this at 3.
- For our
eval_metric
(evaluation metric for validation data), we will useerror
as defined by XGBoost documentation:
Binary classification error rate. It is calculated as #(wrong cases)/#(all cases). For the predictions, the evaluation will regard the instances with prediction value larger than 0.5 as positive instances, and the others as negative instances.
## Set n_estimators to 10
xgb_clf.set_params(n_estimators=10)
## Set max_depth to 3
xgb_clf.set_params(max_depth=3)
# Set the evaluation metric to error
xgb_clf.set_params(eval_metric='error')
# Fit it to the training set
xgb_clf.fit(X_train, y_train)
# Predict the labels of the test set
preds = xgb_clf.predict(X_test)
xgb_clf.get_params