Course Notes: Extreme Gradient Boosting with XGBoost

XGBoost

Kirkwood's Notes

A very popular implementation of gradient boosting that may be applied to supervised learning problems. XGBoost may be applied to classification or regression problems.

Some considerations:

Features can be numeric or categorical
Numeric features should be scaled (e.g., Z-scored)
Categorical features must be encoded (i.e., one-hot)

Why is XGBoost so Popular?

Speed and performance: Parallelizable able to harness multi-cores and GPUs
Consistently outperforms all other single-algorithm methods for supervised learning

When should XGBoost be used

With large numbers of training samples (>1000 observations and less than 100 features), or in general when the number of observations is greater than the number of features
XGBoost is not good for image recognition or computer vision

Building an XGBoost DMatrix

XGBoost gets some of it's processing efficiency from a specialized data object "DMatrix". Data must be typed as a DMatrix object in order to pass it to a native API xgb.train() method or likewise.

XGBoost also has the ability to deal with categorical variables internally, rather than requiring the user to cast each as one-hot encoded values. But, these input categorical features need to be type category in Pandas.

import numpy as np
import seaborn as sns

# Example feature and target arrays
diamonds = sns.load_dataset("diamonds")
X, y = diamonds.drop('price', axis=1), diamonds[['price']]

#Extract text features
cats = X.select_dtypes(exclude=np.number).columns.tolist()
cats

for col in cats: 
    X[col] = X[col].astype('category')
X.dtypes #See category types

DMatrix objects are built as the following:

from sklearn.model_selection import train_test_split

#Train/Test split
X, y = diamonds.drop('price', axis=1), diamonds[['price']]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1973)

#Build DMatrix
dtrain_reg__diamonds = xgb.DMatrix(X_train, y_train, enable_categorical=True)
dtest_reg__diamonds = xgb.DMatrix(X_test, y_test, enable_categorical=True)

Boosting & Base Learners

Boosting is not a specific machine learning algorithm but rather a concept that can be applied to a set of ML algorithms (i.e., a "meta-algorithm"). It is an ensemble meta-model (of which XGBoost is an example) that is used to convert many weak learners into a strong learner. The resulting combination of learners allows non-linear modeling of spaces that linear models may struggle to understand.

Weak Learner: An ML algorithm that is slightly better than chance (e.g., a binary decision tree whose predictions are slightly better than 50%)
Boosting: Iteratively builds a set of weak models on subsets of the data. Then, weights are applied to each weak learner based on its performance on unseen data. The boosting meta-algorithm aggregates those weighted predictions to obtain a prediction that is much stronger than the individual predictions.

Decision Trees

Generally, the base learner of the XGBoost ensemble algorithm. Decision trees make a categorical choice given some data. They are composed of a series of binary decisions that ultimately yield a prediction at the tree's leaves. They are constructed iteratively, one split at a time based on an information criterea that is superior to all other splits. They are built until they run out of information or more-commonly until a stopping criterea is met (to reduce model variance).

The resulting boosted mode is the weighted sum of decision trees. Notice that this leads to a non-linear combination of non-linear base-learners. This is an advantage of decision trees over linear boosters.

CARTs

XGBoost uses a special type of decision tree called a CART: Classification and Regression Tree. In this tree, rather than leaves containing a boolean prediction, each leaf contains a real-valued score regardless of whether it will be used for classification or regression. A threshold is then used if classification is required.

An example using the SciKit-learn API:

import xgboost as xgb
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

#Data
housing_df = pd.read_csv('datasets/ames_housing_trimmed_processed.csv') 
X, y = housing_df.loc[:, housing_df.columns != 'SalePrice'], housing_df.SalePrice
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1973)

#Model 
xg_reg = xgb.XGBRegressor(
    objective='reg:squarederror'
    , booster='gbtree'
    , n_estimators=100
    , seed=1973)
xg_reg.fit(X_train, y_train)

#Results
preds = xg_reg.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, preds))
print("RMSE: %f" % (rmse))

Linear Base Learners

Linear base learners are a sum of linear coefficients. A resulting XGBoost'd model is a weighted sum of linear models, which is linear itself. Note that this ultimatel linearity may be a disadvantage for this booster type, and a reason that it is rarely used.

Linear base learners are only available in the XGBoost learning API (i.e., not SKLearn) with the "gblinear" booster param (as opposed to the default "gbtree").

The following is an example using the XGBoost learning API:

import xgboost as xgb
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

#Data
housing_df = pd.read_csv('datasets/ames_housing_trimmed_processed.csv') 
X, y = housing_df.loc[:, housing_df.columns != 'SalePrice'], housing_df.SalePrice
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1973)
DM_train = xgb.DMatrix(data=X_train, label=y_train)
DM_test = xgb.DMatrix(data=X_test, label=y_test)

#Model
params={
    "booster":"gblinear"
    , "objective":"reg:squarederror"
}
xgb_reg = xgb.train(params
                   , dtrain=DM_train
                   , num_boost_round=100)

#Results
preds = xg_reg.predict(DM_test)
rmse = np.sqrt(mean_squared_error(y_test, preds))
print("RMSE: %f" % (rmse))

Objective Functions & Metrics

Objective functions are A.K.A. loss functions. These functions quantify how far off a prediction is from the actual result. That is, they measure the difference between estimated and true values for a given collection of data. The goal in training an ML model is too minimize this loss function across data passed through the model, especially unseen data at training time. The smaller this loss, the more performant the model is said to be.

The loss function is specified in a dictionary as the 'objective' as follows:

params = {"objective": "reg:squarederror", "tree_method": 'gpu_hist', ...}

Metrics are used after training to evaluate overall performance (e.g., accuracy, recall, precision).

from sklearn.preprocessing import OrdinalEncoder

X, y = diamonds.drop("cut", axis=1), diamonds[['cut']]

Regression with XGBoost

Regression machine learning problems predict real values rather than classes. Common performance metrics used in the regression context are Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE). RMSE is diproportionately sensitive to large differences, while MAE lacks some nice mathematical properties that make it less used as a performance metric (like what?).

The regression objective function commonly used is reg:squarederror

Cross-Validation in XGBoost

Using a single test-set (i.e., the validation set) to validate the results of a machine learning model is problematic, as the model is able to implicitly memorize the test-set in its iterative optimization steps. This is because the hyperparameter optimization will optimize to the specific validation test-set, which may not generalize as well to new test-sets or future observations.

Cross-validation generates many non-overlapping train/test splits. Then the average test-set performance is reported across all splits. The following is an example of cross validation in XGBoost

import xgboost as xgb
import pandas as pd

housing_df = pd.read_csv('datasets/ames_housing_trimmed_processed.csv') 
housing_dmat = xgb.DMatrix(data=housing_df.loc[:, housing_df.columns != 'SalePrice'], label=housing_df.SalePrice)
params={"objective":"reg:squarederror", "max_depth":4} #Defines the type of XGB model & hyperparams 
cv_results = xgb.cv(
	dtrain=housing_dmat\
    , params=params\
    , nfold=4\
	, num_boost_round=1000\
	, metrics="error"
	, as_pandas=True
)

#Results
print("RMSE: %f" %((cv_results['test-error-mean']).iloc[-1]))

Visualization in XGBoost

XGBoost provides several ways to visualize the results of a trained model. Some common visualization techniques include:

Feature Importance Plot
Partial Dependence Plot
Tree Visualization

Let's explore each of these visualization techniques in detail.

Tree Visualization

Builds a tree graph for an arbitrary num_trees numbered booster. Visualizes the boolean splits and ultimate leaf values.

‌
‌
‌