Skip to content

Random Forest Regression

Random forest is an ensemble learning method that creates decision trees on randomly selected data samples and selects the best solution through voting. This template trains and tunes a random forest model for a regression problem (i.e., predicting continuous values). It also evaluates and visualizes feature importance from a resulting model. If you would like to learn more about random forests, take a look at DataCamp's Machine Learning with Tree-Based Models in Python course.

To swap in your dataset in this template, the following is required:

  • There are at least two feature columns and a column with a continuous target variable you would like to predict.
  • The features have been cleaned and preprocessed, including categorical encoding.
  • There are no NaN/NA values. You can use this template to impute missing values if needed.

The placeholder dataset in this template consists of bike sharing demand data with details, such as date and weather. Each row represents an hour of a day and how many bikes were rented (the target variable). You can find more information on this dataset's source and dictionary here.

1. Loading packages and data

# Load packages
import numpy as np
import pandas as pd
from sklearn import metrics
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV


# Load the data and replace with your CSV file path
df = pd.read_csv("data/SeoulBikeData.csv")
# Check if there are any null values
print(df.isnull().sum())
# Check columns to make sure you have features and a target variable
df.info()

2. Splitting the data

To split the data, we'll use the train_test_split() function.

# Split the data into two DataFrames: X (features) and y (target variable)
X = df.iloc[:, 1:]  # Specify at least one column as a feature
y = df["Rented Bike Count"]  # Specify one column as the target variable


# Split the data into train and test subsets
# You can adjust the test size and random state
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.35, random_state=123)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

3. Building a random forest regressor

The following code builds a scikit-learn RandomForestRegressor using the most fundamental parameters. As a reminder, you can learn more about these parameters in DataCamp's Machine Learning with Tree-Based Models in Python course and scikit-learn's documentation.

# Define parameters: these will need to be tuned to prevent overfitting and underfitting
params = {
    "n_estimators": 100,  # Number of trees in the forest
    "max_depth": 10,  # Max depth of the tree
    "min_samples_split": 4,  # Min number of samples required to split a node
    "min_samples_leaf": 2,  # Min number of samples required at a leaf node
    "ccp_alpha": 0,  # Cost complexity parameter for pruning
    "random_state": 123,
}


# Create a RandomForestRegressor object with the parameters above
rf = RandomForestRegressor(**params)


# Train the random forest on the train set
rf = rf.fit(X_train, y_train)


# Predict the outcomes on the test set
y_pred = rf.predict(X_test)

To evaluate this regressor, there are several error metrics we can use. The cole below prints the mean absolute error, mean squared error, and root mean squared error. To learn more about how these are calculated and the other error metrics available, take a look at scikit-learn's documentation. In the end, you'll have to decide which error metric is best suited for your problem.

# Evaluate performance with error metrics
print("Mean Absolute Error:", metrics.mean_absolute_error(y_test, y_pred))
print("Mean Squared Error:", metrics.mean_squared_error(y_test, y_pred))
print("Root Mean Squared Error:", np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

4. Evaluating feature importance

In RandomForestRegressor, there is a property called feature_importances_ which holds the Gini importance of each feature (the higher this value, the more important the feature). We can list and plot feature_importances_ to see which features influence predictions most.

# Create a sorted Series of features importances
importances_sorted = pd.Series(
    data=rf.feature_importances_, index=X_train.columns
).sort_values()


# Plot a horizontal barplot of importances_sorted
importances_sorted.plot(kind="barh")
plt.title("Features Importances")
plt.show()