Skip to main content

Understanding Random Forests Classifiers in Python Tutorial

Learn about Random Forests and build your own model in Python, for both classification and regression.
May 2018  · 14 min read

Learn Python

Random forests is a supervised learning algorithm. It can be used both for classification and regression. It is also the most flexible and easy to use algorithm. A forest is comprised of trees. It is said that the more trees it has, the more robust a forest is. Random forests creates decision trees on randomly selected data samples, gets prediction from each tree and selects the best solution by means of voting. It also provides a pretty good indicator of the feature importance.

Random forests has a variety of applications, such as recommendation engines, image classification and feature selection. It can be used to classify loyal loan applicants, identify fraudulent activity and predict diseases. It lies at the base of the Boruta algorithm, which selects important features in a dataset.

If you are not yet familiar with Tree-Based Models in Machine Learning, you should take a look at our R course on the subject.

The Random Forests Algorithm

Let's understand the algorithm in layman's terms. Suppose you want to go on a trip and you would like to travel to a place which you will enjoy.

So what do you do to find a place that you will like? You can search online, read reviews on travel blogs and portals, or you can also ask your friends.

Let's suppose you have decided to ask your friends, and talked with them about their past travel experience to various places. You will get some recommendations from every friend. Now you have to make a list of those recommended places. Then, you ask them to vote (or select one best place for the trip) from the list of recommended places you made. The place with the highest number of votes will be your final choice for the trip.

In the above decision process, there are two parts. First, asking your friends about their individual travel experience and getting one recommendation out of multiple places they have visited. This part is like using the decision tree algorithm. Here, each friend makes a selection of the places he or she has visited so far.

The second part, after collecting all the recommendations, is the voting procedure for selecting the best place in the list of recommendations. This whole process of getting recommendations from friends and voting on them to find the best place is known as the random forests algorithm.

It technically is an ensemble method (based on the divide-and-conquer approach) of decision trees generated on a randomly split dataset. This collection of decision tree classifiers is also known as the forest. The individual decision trees are generated using an attribute selection indicator such as information gain, gain ratio, and Gini index for each attribute. Each tree depends on an independent random sample. In a classification problem, each tree votes and the most popular class is chosen as the final result. In the case of regression, the average of all the tree outputs is considered as the final result. It is simpler and more powerful compared to the other non-linear classification algorithms.

How does the Algorithm Work?

It works in four steps:

  1. Select random samples from a given dataset.
  2. Construct a decision tree for each sample and get a prediction result from each decision tree.
  3. Perform a vote for each predicted result.
  4. Select the prediction result with the most votes as the final prediction.

How does the Algorithm Work?


  • Random forests is considered as a highly accurate and robust method because of the number of decision trees participating in the process.
  • It does not suffer from the overfitting problem. The main reason is that it takes the average of all the predictions, which cancels out the biases.
  • The algorithm can be used in both classification and regression problems.
  • Random forests can also handle missing values. There are two ways to handle these: using median values to replace continuous variables, and computing the proximity-weighted average of missing values.
  • You can get the relative feature importance, which helps in selecting the most contributing features for the classifier.


  • Random forests is slow in generating predictions because it has multiple decision trees. Whenever it makes a prediction, all the trees in the forest have to make a prediction for the same given input and then perform voting on it. This whole process is time-consuming.
  • The model is difficult to interpret compared to a decision tree, where you can easily make a decision by following the path in the tree.

Finding Important Features

Random forests also offers a good feature selection indicator. Scikit-learn provides an extra variable with the model, which shows the relative importance or contribution of each feature in the prediction. It automatically computes the relevance score of each feature in the training phase. Then it scales the relevance down so that the sum of all scores is 1.

This score will help you choose the most important features and drop the least important ones for model building.

Random forest uses gini importance or mean decrease in impurity (MDI) to calculate the importance of each feature. Gini importance is also known as the total decrease in node impurity. This is how much the model fit or accuracy decreases when you drop a variable. The larger the decrease, the more significant the variable is. Here, the mean decrease is a significant parameter for variable selection. The Gini index can describe the overall explanatory power of the variables.

Random Forests vs Decision Trees

  • Random forests is a set of multiple decision trees.
  • Deep decision trees may suffer from overfitting, but random forests prevents overfitting by creating trees on random subsets.
  • Decision trees are computationally faster.
  • Random forests is difficult to interpret, while a decision tree is easily interpretable and can be converted to rules.

Building a Classifier using Scikit-learn

You will be building a model on the iris flower dataset, which is a very famous classification set. It comprises the sepal length, sepal width, petal length, petal width, and type of flowers. There are three species or classes: setosa, versicolor, and virginia. You will build a model to classify the type of flower. The dataset is available in the scikit-learn library or you can download it from the UCI Machine Learning Repository.

Start by importing the datasets library from scikit-learn, and load the iris dataset with load_iris().

#Import scikit-learn dataset library
from sklearn import datasets

#Load dataset
iris = datasets.load_iris()

You can print the target and feature names, to make sure you have the right dataset, as such:

# print the label species(setosa, versicolor,virginica)

# print the names of the four features
['setosa' 'versicolor' 'virginica']
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

It's a good idea to always explore your data a bit, so you know what you're working with. Here, you can see the first five rows of the dataset are printed, as well as the target variable for the whole dataset.

# print the iris data (top 5 records)

# print the iris labels (0:setosa, 1:versicolor, 2:virginica)
[[ 5.1  3.5  1.4  0.2]
 [ 4.9  3.   1.4  0.2]
 [ 4.7  3.2  1.3  0.2]
 [ 4.6  3.1  1.5  0.2]
 [ 5.   3.6  1.4  0.2]]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]

Here, you can create a DataFrame of the iris dataset the following way.

# Creating a DataFrame of given iris dataset.
import pandas as pd
    'sepal length'[:,0],
    'sepal width'[:,1],
    'petal length'[:,2],
    'petal width'[:,3],
  petal length petal width sepal length sepal width species
0 1.4 0.2 5.1 3.5 0
1 1.4 0.2 4.9 3.0 0
2 1.3 0.2 4.7 3.2 0
3 1.5 0.2 4.6 3.1 0
4 1.4 0.2 5.0 3.6 0

First, you separate the columns into dependent and independent variables (or features and labels). Then you split those variables into a training and test set.

# Import train_test_split function
from sklearn.model_selection import train_test_split

X=data[['sepal length', 'sepal width', 'petal length', 'petal width']]  # Features
y=data['species']  # Labels

# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3) # 70% training and 30% test

After splitting, you will train the model on the training set and perform predictions on the test set.

#Import Random Forest Model
from sklearn.ensemble import RandomForestClassifier

#Create a Gaussian Classifier

#Train the model using the training sets y_pred=clf.predict(X_test),y_train)


After training, check the accuracy using actual and predicted values.

#Import scikit-learn metrics module for accuracy calculation
from sklearn import metrics
# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
('Accuracy:', 0.93333333333333335)

You can also make a prediction for a single item, for example:

  • sepal length = 3
  • sepal width = 5
  • petal length = 4
  • petal width = 2

Now you can predict which type of flower it is.

clf.predict([[3, 5, 4, 2]])

Here, 2 indicates the flower type Virginica.

Start Learning Python For Free

Machine Learning with Tree-Based Models in Python

5 hr
In this course, you'll learn how to use tree-based models and ensembles for regression and classification using scikit-learn.
See DetailsRight Arrow
Start Course

Finding Important Features in Scikit-learn

Here, you are finding important features or selecting features in the IRIS dataset. In scikit-learn, you can perform this task in the following steps:

  • First, you need to create a random forests model.
  • Second, use the feature importance variable to see feature importance scores.
  • Third, visualize these scores using the seaborn library.
from sklearn.ensemble import RandomForestClassifier

#Create a Gaussian Classifier

#Train the model using the training sets y_pred=clf.predict(X_test),y_train)
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
import pandas as pd
feature_imp = pd.Series(clf.feature_importances_,index=iris.feature_names).sort_values(ascending=False)
petal width (cm)     0.458607
petal length (cm)    0.413859
sepal length (cm)    0.103600
sepal width (cm)     0.023933
dtype: float64

You can also visualize the feature importance. Visualizations are easy to understand and interpretable.

For visualization, you can use a combination of matplotlib and seaborn. Because seaborn is built on top of matplotlib, it offers a number of customized themes and provides additional plot types. Matplotlib is a superset of seaborn and both are equally important for good visualizations.

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# Creating a bar plot
sns.barplot(x=feature_imp, y=feature_imp.index)
# Add labels to your graph
plt.xlabel('Feature Importance Score')
plt.title("Visualizing Important Features")

Finding Important Features in Scikit-learn

Generating the Model on Selected Features

Here, you can remove the "sepal width" feature because it has very low importance, and select the 3 remaining features.

# Import train_test_split function
from sklearn.cross_validation import train_test_split
# Split dataset into features and labels
X=data[['petal length', 'petal width','sepal length']]  # Removed feature "sepal length"
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.70, random_state=5) # 70% training and 30% test

After spliting, you will generate a model on the selected training set features, perform predictions on the selected test set features, and compare actual and predicted values.

from sklearn.ensemble import RandomForestClassifier

#Create a Gaussian Classifier

#Train the model using the training sets y_pred=clf.predict(X_test),y_train)

# prediction on test set

#Import scikit-learn metrics module for accuracy calculation
from sklearn import metrics
# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
('Accuracy:', 0.95238095238095233)

You can see that after removing the least important features (sepal length), the accuracy increased. This is because you removed misleading data and noise, resulting in an increased accuracy. A lesser amount of features also reduces the training time.


Congratulations, you have made it to the end of this tutorial!

In this tutorial, you have learned what random forests is, how it works, finding important features, the comparison between random forests and decision trees, advantages and disadvantages. You have also learned model building, evaluation and finding important features in scikit-learn.

If you would like to learn more about machine learning, I recommend you take a look at our Supervised Learning in R: Classification course.

Random Forest Faqs

What is random forest?

A supervised learning algorithm that combines in a single model the outputs of many decision trees on various subsets of a dataset. The resulting model represents the average outcome of all the decision trees, which improves the accuracy of predictions. Random forest can be used both for regression and classification problems.

What is random forest good for?

It is especially good for classification and regression tasks on datasets with many entries and features presumably with missing values when we need to obtain a highly-accurate result whilst avoiding overfitting. Also, random forest provides the relative feature importance, which allows to select the most relevant features.

Is random forest interpretable?

Yes. It is much less interpretable than decision tree, but more interpretable than black-box models of neural networks. Anyway, it is still possible to display any tree of a random forest (see the question Is it possible to display individual trees of a random forest?).

Why is random forest called random?

Because each decision tree in a random forest is trained on a randomly selected sample from the training set and, possibly, a random set of features. The latter is a default option for random forest in scikit-learn Python library.

What is the output of random forest?

For classification tasks, it is the class (or the probability of that class) determined by the majority of decision trees for regression - the mean prediction of all the decision trees.

How to define the optimal number of trees in random forest?

A good approach is to create a random forest with a large number of estimators (e.g., 800-1000) and select an optimal subset of trees from it. Usually, the more entries in the training set, the more trees a random forest has to include.

How does random forest algorithm work?

Random forest selects a random sample from the training set, creates a decision tree for it and gets a prediction; it repeats this operation for the assigned number of the trees, performs a vote for each prediction, and takes the result with the majority of votes (in case of classification) or the average (in case of regression).

What are some real-world applications of random forest?

Customer segmentation, credit card fraudulent activity detection, stock market prediction, disease prediction, price optimization, sentiment analysis, and product recommendation.

Is it possible to display individual trees of a random forest?

Yes. In Python, you can use the function plot_tree of sklearn.tree and pass as an argument the index of the necessary decision tree (e.g., clf.estimators_[0] to display the first tree of a clf random forest).

Is a random forest with one estimator just a decision tree?

Yes - Even though in this case, it doesn't make sense to use the random forest algorithm. However, you can eliminate the effect of randomness and precisely reproduce the output of a random forest with one tree using the decision tree algorithm. In scikit-learn library of Python, using RandomForestClassifier(n_estimators=1, max_features=None, bootstrap=False, random_state=1) would give the same output as DecisionTreeClassifier(random_state=1).

Introduction to Python

4 hours
Master the basics of data analysis with Python in just four hours. This online course will introduce the Python interface and explore popular packages.
See DetailsRight Arrow
Start Course

Intermediate Python

4 hours
Level up your data science skills by creating visualizations using Matplotlib and manipulating DataFrames with pandas.

Linear Classifiers in Python

4 hours
In this course you will learn the details of linear classifiers like logistic regression and SVM.
See all coursesRight Arrow

How to Become a Data Analyst in 2023: 5 Steps to Start Your Career

Learn how to become a data analyst and discover everything you need to know about launching your career, including the skills you need and how to learn them.
Elena Kosourova 's photo

Elena Kosourova

18 min

Sports Analytics: How Different Sports Use Data Analytics

Discover how sports analytics works and how different sports use data to provide meaningful insights. Plus, discover what it takes to become a sports data analyst.
Kurtis Pykes 's photo

Kurtis Pykes

13 min

The 23 Top Python Interview Questions & Answers

Essential Python interview questions with examples for job seekers, final-year students, and data professionals.
Abid Ali Awan's photo

Abid Ali Awan

22 min

Top Machine Learning Use-Cases and Algorithms

Machine learning is arguably responsible for data science and artificial intelligence’s most prominent and visible use cases. In this article, learn about machine learning, some of its prominent use cases and algorithms, and how you can get started.
Vidhi Chugh's photo

Vidhi Chugh

15 min

Getting started with Python cheat sheet

Python is the most popular programming language in data science. Use this cheat sheet to jumpstart your Python learning journey.
DataCamp Team's photo

DataCamp Team

8 min

A Complete Guide to Data Augmentation

Learn about data augmentation techniques, applications, and tools with a TensorFlow and Keras tutorial.
Abid Ali Awan's photo

Abid Ali Awan

15 min

See MoreSee More