Tutorials
machine learning
+1

Ensemble Learning in Python

In this tutorial, you'll learn what ensemble is and how it improves the performance of a machine learning model.

You all know that the field of machine learning keeps getting better and better with time. Predictive models form the core of machine learning. Better the accuracy better the model is and so is the solution to a particular problem. In this post, you are going to learn about something called Ensemble learning which is a potent technique to improve the performance of your machine learning model.

In this post you will cover:

  • What is Ensemble learning?
  • How it improves the performance of a machine learning model?
  • Different Ensemble learning methods
  • Pitfalls of Ensembles
  • A Pythonic implementation of different Ensemble learning methods with a real test dataset
  • Further studies on Ensemble learning

So, let's get started.

What is Ensemble learning?

In the world of Statistics and Machine Learning, Ensemble learning techniques attempt to make the performance of the predictive models better by improving their accuracy. Ensemble Learning is a process using which multiple machine learning models (such as classifiers) are strategically constructed to solve a particular problem.

Let's take a real example to build the intuition.

Suppose, you want to invest in a company XYZ. You are not sure about its performance though. So, you look for advice on whether the stock price will increase by more than 6% per annum or not? You decide to approach various experts having diverse domain experience:

  • Employee of Company XYZ: This person knows the internal functionality of the company and has the insider information about the functionality of the firm. But he lacks a broader perspective on how are competitors innovating, how is the technology evolving and what will be the impact of this evolution on Company XYZ’s product. In the past, he has been right 70% times.

  • Financial Advisor of Company XYZ: This person has a broader perspective on how companies strategy will fair in this competitive environment. However, he lacks a view on how the company’s internal policies are fairing off. In the past, he has been right 75% times.

  • Stock Market Trader: This person has observed the company’s stock price over the past 3 years. He knows the seasonality trends and how the overall market is performing. He also has developed a keen intuition on how stocks might vary over time. In the past, he has been right 70% times.

  • Employee of a competitor: This person knows the internal functionality of the competitor firms and is aware of certain changes which are yet to be brought. He lacks a sight of the company in focus and the external factors which can relate the growth of competitor with the company of subject. In the past, he has been right 60% of times.

  • Market Research team in the same segment: This team analyzes the customer preference of company XYZ’s product over others and how is this changing with time. Because he deals with customer side, he is unaware of the changes company XYZ will bring because of alignment to its own goals. In the past, they have been right 75% of times.

  • Social Media Expert: This person can help us understand how has company XYZ has positioned its products in the market. And how are the sentiment of customers changing over time towards the company? He is unaware of any kind of details beyond digital marketing. In the past, he has been right 65% of times.

  • Given the broad spectrum of access you have, you can probably combine all the information and make an informed decision.

    In a scenario when all the 6 experts/teams verify that it’s a good decision(assuming all the predictions are independent of each other), you will get a combined accuracy rate of 1 - (30% . 25% . 30% . 40% . 25% . 35%) = 1 - 0.07875 = 99.92125%

    The assumption used here that all the predictions are completely independent is slightly extreme as they are expected to be correlated. However, you can see how we can be so sure by combining various forecasts together.

    Well, Ensemble learning is no different.

    An ensemble is the art of combining a diverse set of learners (individual models) together to improvise on the stability and predictive power of the model. In the above example, the way we combine all the predictions collectively will be termed as Ensemble learning.

    Moreover, Ensemble-based models can be incorporated in both of the two scenarios, i.e., when data is of large volume and when data is too little.

    Let’s now understand how do you actually get different set of machine learning models. Models can be different from each other for a variety of reasons:

    • There can be difference in the population of data.
    • There can be a different modeling technique used.
    • There can be a different hypothesis.


    Imagine that you are playing trivial pursuit. When you play alone, there might be some topics you are good at, and some that you know next to nothing about. If we want to maximize our trivial pursuit score, we need to build a team to cover all topics. This is the basic idea of an ensemble: combining predictions from several models averages out idiosyncratic errors and yield better overall predictions.

    The following picture shows an example schematics of an ensemble.


    Source


    In the picture above, An input array X is fed through two preprocessing pipelines and then to a set of base learners f(i). The ensemble combines all base learner predictions into a final prediction array P.

    Now, the important question is how to combine predictions. In the trivial pursuit example, it is easy to imagine that team members might make their case and majority voting decides which to pick. Machine learning is remarkably similar in classification problems: taking the most common class label prediction is equivalent to a majority voting rule. But there are many other ways to combine predictions, and more generally you can use a model to learn how to combine predictions best.

    The following diagram presents a basic Ensemble structure:


    Source


    Here, Data is fed to a set of models, and a meta-learner combine model predictions.

    Model error and reducing this error with Ensembles:

    The error emerging from any machine model can be broken down into three components mathematically. Following are these component:

    Bias + Variance + Irreducible error


    Why is this important in the current context? To understand what goes on behind an ensemble model, you need first to know what causes an error in the model. You will briefly get introduced to these errors.

    Bias error is useful to quantify how much on an average are the predicted values different from the actual value. A high bias error means we have an under-performing model which keeps on missing essential trends.

    Variance on the other side quantifies how are the prediction made on the same observation different from each other. A high variance model will over-fit on your training population and perform poorly on any observation beyond training. Following diagram will give you more clarity (Assume that red spot is the real value, and blue dots are predictions):



    Source


    Typically, as you increase the complexity of your model, you will see a reduction in error due to lower bias in the model. However, this only happens until a particular point. As you continue to make your model more complex, you end up over-fitting your model, and hence your model will start suffering from the high variance.

    Now that you are familiar with the basics of Ensemble learning let's look at different Ensemble learning techniques:

    Different types of Ensemble learning methods:

    Although there are several types of Ensemble learning methods, the following three are the most-used ones in the industry.

    Bagging based Ensemble learning:

    Bagging is one of the Ensemble construction techniques which is also known as Bootstrap Aggregation. Bootstrap establishes the foundation of Bagging technique. Bootstrap is a sampling technique in which we select “n” observations out of a population of “n” observations. But the selection is entirely random, i.e., each observation can be chosen from the original population so that each observation is equally likely to be selected in each iteration of the bootstrapping process. After the bootstrapped samples are formed, separate models are trained with the bootstrapped samples. In real experiments, the bootstrapped samples are drawn from the training set, and the sub-models are tested using the testing set. The final output prediction is combined across the projections of all the sub-models.

    The following infographic gives a brief idea of Bagging:


    Source

    Boosting-based Ensemble learning:

    Boosting is a form of sequential learning technique. The algorithm works by training a model with the entire training set, and subsequent models are constructed by fitting the residual error values of the initial model. In this way, Boosting attempts to give higher weight to those observations that were poorly estimated by the previous model. Once the sequence of the models are created the predictions made by models are weighted by their accuracy scores and the results are combined to create a final estimation. Models that are typically used in Boosting technique are XGBoost (Extreme Gradient Boosting), GBM (Gradient Boosting Machine), ADABoost (Adaptive Boosting), etc.

    Voting based Ensemble learning:

    Voting is one of the most straightforward Ensemble learning techniques in which predictions from multiple models are combined. The method starts with creating two or more separate models with the same dataset. Then a Voting based Ensemble model can be used to wrap the previous models and aggregate the predictions of those models. After the Voting based Ensemble model is constructed, it can be used to make a prediction on new data. The predictions made by the sub-models can be assigned weights. Stacked aggregation is a technique which can be used to learn how to weigh these predictions in the best possible way.

    The following infographic best describes Voting-based Ensembles:


    Source


    Well, the time has come when you apply these concepts to strengthen your intuition and confidence. Let's do it in Python.

    A case study in Python

    The dataset you are going to be using for this case study is popularly known as the Wisconsin Breast Cancer dataset. The task related to it is Classification.

    The dataset contains a total number of 10 features labeled in either benign or malignant classes. The features have 699 instances out of which 16 feature values are missing. The dataset only contains numeric values.

    The dataset can be downloaded from here.

    You will implement the Ensembles using the mighty scikit-learn library.

    Let's first import all the Python dependencies you will be needing for this case study.

    import pandas as pd
    import numpy as np
    from sklearn.preprocessing import Imputer
    from sklearn.preprocessing import MinMaxScaler
    

    Let's load the dataset in a DataFrame object.

    data = pd.read_csv('cancer.csv')
    data.head()
    

    The column "Sample code number" is just an indicator and it's of no use in the modeling. So, let's drop it:

    data.drop(['Sample code number'],axis = 1, inplace = True)
    
    data.head()
    

    You can see that the column is dropped now. Let's get some statistics about the data using Panda's describe() and info() functions:

    data.describe()
    
    data.info()
    
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 699 entries, 0 to 698
    Data columns (total 10 columns):
    Clump Thickness                699 non-null int64
    Uniformity of Cell Size        699 non-null int64
    Uniformity of Cell Shape       699 non-null int64
    Marginal Adhesion              699 non-null int64
    Single Epithelial Cell Size    699 non-null int64
    Bare Nuclei                    699 non-null object
    Bland Chromatin                699 non-null int64
    Normal Nucleoli                699 non-null int64
    Mitoses                        699 non-null int64
    Class                          699 non-null int64
    dtypes: int64(9), object(1)
    memory usage: 54.7+ KB
    

    As mentioned earlier, the dataset contains missing values. The column named "Bare Nuclei" contains them. Let's verify.

    data['Bare Nuclei']
    
    0       1
    1      10
    2       2
    3       4
    4       1
    5      10
    6      10
    7       1
    8       1
    9       1
    10      1
    11      1
    12      3
    13      3
    14      9
    15      1
    16      1
    17      1
    18     10
    19      1
    20     10
    21      7
    22      1
    23      ?
    24      1
    25      7
    26      1
    27      1
    28      1
    29      1
           ..
    669     5
    670     8
    671     1
    672     1
    673     1
    674     1
    675     1
    676     1
    677     1
    678     1
    679     1
    680    10
    681    10
    682     1
    683     1
    684     1
    685     1
    686     1
    687     1
    688     1
    689     1
    690     1
    691     5
    692     1
    693     1
    694     2
    695     1
    696     3
    697     4
    698     5
    Name: Bare Nuclei, dtype: object
    

    You can spot some "?"s in it, right? Well, these are your missing values, and you will be imputing them with Mean Imputation. But first, you will replace those "?"s with 0's.

    data.replace('?',0, inplace=True)
    
    data['Bare Nuclei']
    
    0       1
    1      10
    2       2
    3       4
    4       1
    5      10
    6      10
    7       1
    8       1
    9       1
    10      1
    11      1
    12      3
    13      3
    14      9
    15      1
    16      1
    17      1
    18     10
    19      1
    20     10
    21      7
    22      1
    23      0
    24      1
    25      7
    26      1
    27      1
    28      1
    29      1
           ..
    669     5
    670     8
    671     1
    672     1
    673     1
    674     1
    675     1
    676     1
    677     1
    678     1
    679     1
    680    10
    681    10
    682     1
    683     1
    684     1
    685     1
    686     1
    687     1
    688     1
    689     1
    690     1
    691     5
    692     1
    693     1
    694     2
    695     1
    696     3
    697     4
    698     5
    Name: Bare Nuclei, dtype: object
    

    The "?"s are replaced with 0's now. Let's do the missing value treatment now.

    # Convert the DataFrame object into NumPy array otherwise you will not be able to impute
    values = data.values
    
    # Now impute it
    imputer = Imputer()
    imputedData = imputer.fit_transform(values)
    

    Now if you take a look at the dataset itself, you will see that all the ranges of the features of the dataset are not the same. This may cause a problem. A small change in a feature might not affect the other. To address this problem, you will normalize the ranges of the features to a uniform range, in this case, 0 - 1.

    scaler = MinMaxScaler(feature_range=(0, 1))
    normalizedData = scaler.fit_transform(imputedData)
    

    Wonderful!

    You have performed all the preprocessing that was required in order to perform your Ensembling experiments.

    You will start with Bagging based Ensembling. In this case, you will use a Bagged Decision Tree.

    # Bagged Decision Trees for Classification - necessary dependencies
    
    from sklearn import model_selection
    from sklearn.ensemble import BaggingClassifier
    from sklearn.tree import DecisionTreeClassifier
    

    You have imported the dependencies for the Bagged Decision Trees.

    # Segregate the features from the labels
    X = normalizedData[:,0:9]
    Y = normalizedData[:,9]
    
    kfold = model_selection.KFold(n_splits=10, random_state=7)
    cart = DecisionTreeClassifier()
    num_trees = 100
    model = BaggingClassifier(base_estimator=cart, n_estimators=num_trees, random_state=7)
    results = model_selection.cross_val_score(model, X, Y, cv=kfold)
    print(results.mean())
    
    0.9571428571428573
    

    Let's see what you did in the above cell.

    First, you initialized a 10-fold cross-validation fold. After that, you instantiated a Decision Tree Classifier with 100 trees and wrapped it in a Bagging-based Ensemble. Then you evaluated your model.

    You model performed pretty well. It yielded an accuracy of 95.71%.

    Brilliant! Let's implement the other ones.

    (If you want a quick refresher on cross-validation then this is the link to go for.)

    # AdaBoost Classification
    
    from sklearn.ensemble import AdaBoostClassifier
    seed = 7
    num_trees = 70
    kfold = model_selection.KFold(n_splits=10, random_state=seed)
    model = AdaBoostClassifier(n_estimators=num_trees, random_state=seed)
    results = model_selection.cross_val_score(model, X, Y, cv=kfold)
    print(results.mean())
    
    0.9557142857142857
    

    In this case, you did an AdaBoost classification (with 70 trees) which is based on Boosting type of Ensembling. The model gave you an accuracy of 95.57% for 10-fold cross-validation.

    Finally, it's time for you to implement the Voting-based Ensemble technique.

    # Voting Ensemble for Classification
    
    from sklearn.linear_model import LogisticRegression
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.svm import SVC
    from sklearn.ensemble import VotingClassifier
    
    kfold = model_selection.KFold(n_splits=10, random_state=seed)
    # create the sub models
    estimators = []
    model1 = LogisticRegression()
    estimators.append(('logistic', model1))
    model2 = DecisionTreeClassifier()
    estimators.append(('cart', model2))
    model3 = SVC()
    estimators.append(('svm', model3))
    # create the ensemble model
    ensemble = VotingClassifier(estimators)
    results = model_selection.cross_val_score(ensemble, X, Y, cv=kfold)
    print(results.mean())
    

    0.9642857142857142

    You implemented a Voting based Ensemble model where you took Logistic Regression, Decision Tree and Support Vector Machine for voting purpose. The model performed the best so far with an accuracy of 96.42% for 10-fold cross-validation.

    Now, let's get you familiarized with some common pitfalls of Ensemble learning.

    Pitfalls of Ensemble learning

    In general, it is not true that it will always perform better. There are several ensemble methods, each with its own advantages/weaknesses. Which one to use and then depends on the problem at hand.

    For example, if you have models with high variance (they over-fit your data), then you are likely to benefit from using bagging. If you have biased models, it is better to combine them with Boosting. There are also different strategies to form ensembles. The topic is just too broad to cover it in one answer.

    But the point is: if you use the wrong ensemble method for your setting, you are not going to do better. For example, using Bagging with a biased model is not going to help.

    Also, if you need to work in a probabilistic setting, ensemble methods may not work either. It is known that Boosting (in its most popular forms like AdaBoost) delivers poor probability estimates. That is, if you would like to have a model that allows you to reason about your data, not only classification, you might be better off with a graphical model.

    So, in this post, you got introduced to Ensemble learning technique. You covered its basics, how it improves your model's performance. You covered its three main types.

    Also, you implemented these three types in Python with the help of scikit-learn, and in this course of action, you gained a bit of knowledge about the necessary preprocessing steps.

    That's quite a feat! Well done! In this final section, I suggest some further undertakings on Ensembles which you might want to consider.

    Take it further:

    • Try other Boosting-based Ensemble techniques viz. Gradient Boosting, XGBoost, etc.
    • Play with the different parameter settings that scikit-learn offers in Ensembles and then try to find why a particular setting performed well. This will make your understanding even stronger. link
    • Try Ensemble learning on a variety of datasets to understand where you should and where you should not apply Ensemble learning. For finding datasets Kaggle, UCI Repository, etc. are good places to search.

    Some references that I took for writing this tutorial:

    I hope you enjoyed this tutorial. Let me know your doubts (if you have any) in the comments section.

    If you are interested in learning more about Ensemble's in Machine Learning, take DataCamp's Machine Learning with Tree-Based Models in Python course.

Want to leave a comment?