In this course, you'll learn about tree-based models for classification and regression.
Course Overview
- Chapter 1: Introduction to Classification-And-Regression-Tree (CART).
- Chapter 2: Bias-variance trade-off and model ensembling.
- Chapter 3: Bagging and Random Forests.
- Chapter 4: Boosting with AdaBoost and Gradient Boosting.
- Chapter 5: Hyperparameter-tuning.
Classification-tree A classification tree learns a sequence of if-else questions about features to infer labels, capturing non-linear relationships without requiring feature standardization.
Breast Cancer Dataset in 2D We'll predict whether a tumor is malignant or benign using two features from the Wisconsin Breast Cancer dataset.
Decision-tree Diagram A trained classification tree asks if-else questions involving one feature and one split-point, traversing branches until reaching a prediction. The maximum depth here is 2.
Classification-tree in scikit-learn
- Import
DecisionTreeClassifier
fromsklearn.tree
,train_test_split
fromsklearn.model_selection
, andaccuracy_score
fromsklearn.metrics
. - Split data into 80% train and 20% test with
train_test_split()
, settingstratify
toy
. - Instantiate a tree classifier
dt
withmax_depth=2
andrandom_state=1
. - Fit
dt
withX_train
andy_train
, predict test-set labels, and print test set accuracy usingaccuracy_score()
.
Decision Regions A classification model divides the feature-space into decision-regions, separated by decision-boundaries. Linear classifiers have straight-line boundaries, while classification-trees produce rectangular decision-regions.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
# Import DecisionTreeClassifier from sklearn.tree
from sklearn.tree import DecisionTreeClassifier
# Instantiate a DecisionTreeClassifier 'dt' with a maximum depth of 6
dt = DecisionTreeClassifier(max_depth=6, random_state=SEED)
# Fit dt to the training set
dt.fit(X_train, y_train)
# Predict test set labels
y_pred = dt.predict(X_test)
print(y_pred[0:5])
# Import accuracy_score
from sklearn.metrics import accuracy_score
# Predict test set labels
y_pred = dt.predict(X_test)
# Compute test set accuracy
acc = accuracy_score(y_test, y_pred)
print("Test set accuracy: {:.2f}".format(acc))
# Import LogisticRegression from sklearn.linear_model
from sklearn.linear_model import LogisticRegression
# Instatiate logreg
logreg = LogisticRegression(random_state=1)
# Fit logreg to the training set
logreg.fit(X_train, y_train)
# Define a list called clfs containing the two classifiers logreg and dt
clfs = [logreg, dt]
# Review the decision regions of the two classifiers
plot_labeled_decision_regions(X_test, y_test, clfs)
Building Blocks of a Decision-Tree
A decision-tree is a data-structure consisting of nodes. A node involves either a question or a prediction. The root is the starting node with no parent, leading to two children nodes. Internal nodes have a parent and also lead to two children nodes. A leaf has no children and is where a prediction is made. The tree is trained to produce the purest leaves, where one class-label is predominant.
Prediction
In a tree diagram, if an instance reaches a leaf with 257 benign and 7 malignant instances, the prediction is 'benign'. The tree aims to produce pure leaves by maximizing information gain.
Information Gain (IG)
Nodes are grown recursively. At each node, a question involving a feature and a split-point is asked to maximize information gain. The information gain is calculated based on the impurity of nodes, using criteria like the gini-index or entropy.
Classification-Tree Learning
In an unconstrained tree, nodes are grown recursively to maximize information gain. If the information gain is null, the node becomes a leaf. Constraining the tree's depth affects when nodes are declared leaves.
Information Criterion in scikit-learn
For the breast cancer dataset, set the information criterion of dt
to the gini-index by setting the criterion
parameter to 'gini'. Fit dt
to the training set, predict the test set labels, and determine the test set accuracy, which is about 92%.
# Import DecisionTreeClassifier from sklearn.tree
from sklearn.tree import DecisionTreeClassifier
# Instantiate dt_entropy, set 'entropy' as the information criterion
dt_entropy = DecisionTreeClassifier(criterion='entropy', max_depth=8, random_state=1)
# Fit dt_entropy to the training set
dt_entropy.fit(X_train, y_train)
# Import accuracy_score from sklearn.metrics
from sklearn.metrics import accuracy_score
# Use dt_entropy to predict test set labels
y_pred= dt_entropy.predict(X_test)
# Evaluate accuracy_entropy
accuracy_entropy = accuracy_score(y_test, y_pred)
# Print accuracy_entropy
print(f'Accuracy achieved by using entropy: {accuracy_entropy:.3f}')
# Print accuracy_gini
print(f'Accuracy achieved by using the gini index: {accuracy_gini:.3f}')
Training a Decision Tree for Regression
Auto-mpg Dataset
The automobile miles-per-gallon (mpg) dataset from the UCI Machine Learning Repository consists of 6 features and a continuous target variable labeled mpg. Our task is to predict the mpg consumption of a car given these features, focusing on the displacement feature (displ).
Regression-Tree in scikit-learn
To train a decision tree for regression:
- Import
DecisionTreeRegressor
fromsklearn.tree
,train_test_split
fromsklearn.model_selection
, andmean_squared_error
asMSE
fromsklearn.metrics
. - Split the data into 80% train and 20% test using
train_test_split
. - Instantiate
DecisionTreeRegressor
withmax_depth=4
andmin_samples_leaf=0.1
. - Fit the model to the training set and predict the test set labels.
- Evaluate the root-mean-squared-error (RMSE) by computing the mean-squared error and taking its square root.
Information Criterion for Regression-Tree
The impurity of a node in a regression tree is measured using the mean-squared error of the targets in that node. The tree aims to find splits that produce leaves where the target values are closest to the mean value of the labels in that leaf.
Prediction
When a new instance reaches a leaf, its target variable 'y' is computed as the average of the target variables in that leaf.
Linear Regression vs. Regression-Tree
Regression trees are more flexible than linear models and can capture non-linear trends in the data. While a linear regression model may fail to capture such trends, a regression tree can provide a better fit, though not perfect. Aggregating predictions from multiple trees can improve results.
# Import DecisionTreeRegressor from sklearn.tree
from sklearn.tree import DecisionTreeRegressor
# Instantiate dt
dt = DecisionTreeRegressor(max_depth=8,
min_samples_leaf=0.13,
random_state=3)
# Fit dt to the training set
dt.fit(X_train, y_train)
# Import mean_squared_error from sklearn.metrics as MSE
from sklearn.metrics import mean_squared_error as MSE
# Compute y_pred
y_pred = dt.predict(X_test)
# Compute mse_dt
mse_dt = MSE(y_test, y_pred)
# Compute rmse_dt
rmse_dt = mse_dt**(1/2)
# Print rmse_dt
print("Test set RMSE of dt: {:.2f}".format(rmse_dt))
# Predict test set labels
y_pred_lr = lr.predict(X_test)
# Compute mse_lr
mse_lr = MSE(y_test, y_pred_lr)
# Compute rmse_lr
rmse_lr = mse_lr**0.5
# Print rmse_lr
print('Linear Regression test set RMSE: {:.2f}'.format(rmse_lr))
# Print rmse_dt
print('Regression Tree test set RMSE: {:.2f}'.format(rmse_dt))
Generalization Error in Supervised Machine Learning
In supervised learning, the goal is to find a model ( \hat{f} ) that best approximates the true function ( f ) mapping features to labels, while discarding noise. Two main challenges are overfitting, where the model fits the noise in the training set, and underfitting, where the model is not flexible enough to capture the true relationship.
Overfitting: The model memorizes the noise, leading to low training error but high test error.
Underfitting: The model is too simple, resulting in high errors for both training and test sets.
Generalization Error: This error measures how well the model performs on unseen data and can be decomposed into bias, variance, and irreducible error.
- Bias: Indicates the difference between ( \hat{f} ) and ( f ). High bias leads to underfitting.
- Variance: Indicates the inconsistency of ( \hat{f} ) over different training sets. High variance leads to overfitting.
Model Complexity: Determines the model's flexibility. For example, increasing the maximum depth of a decision tree increases its complexity.
Bias-Variance Tradeoff: The optimal model complexity balances bias and variance to achieve the lowest generalization error. As model complexity increases, variance increases and bias decreases, and vice versa. The goal is to find a balance between the two.
Diagnosing Bias and Variance Problems
Estimating the Generalization Error:
- Train a model ( \hat{f} ) and split the data into training and test sets.
- Fit ( \hat{f} ) to the training set and evaluate its error on the test set.
- The generalization error is approximated by ( \hat{f} )'s error on the test set.
Better Model Evaluation with Cross-Validation:
- Use cross-validation (CV) to obtain a reliable estimate of ( \hat{f} )'s performance.
- K-Fold CV: Split the training set into K folds, train on K-1 folds, and evaluate on the remaining fold. Repeat K times and compute the mean error.
Diagnose Variance Problems:
- If CV error > training error, ( \hat{f} ) has high variance (overfitting). Remedy: decrease model complexity or gather more data.
Diagnose Bias Problems:
- If CV error ≈ training error but both are high, ( \hat{f} ) has high bias (underfitting). Remedy: increase model complexity or gather more relevant features.
K-Fold CV in sklearn:
- Use
cross_val_score()
fromsklearn.model_selection
. - Split dataset into 70% train and 30% test using
train_test_split()
. - Instantiate
DecisionTreeRegressor()
withmax_depth=4
andmin_samples_leaf=0.14
. - Call
cross_val_score()
withcv=10
andscoring='neg_mean_squared_error'
. - Compute CV-MSE and fit the model to the training set.
- Evaluate train and test set errors to diagnose overfitting or underfitting.
# Import train_test_split from sklearn.model_selection
from sklearn.model_selection import train_test_split
# Set SEED for reproducibility
SEED = 1
# Split the data into 70% train and 30% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=SEED)
# Instantiate a DecisionTreeRegressor dt
dt = DecisionTreeRegressor(min_samples_leaf=0.26, max_depth=4, random_state=SEED)