# Ensemble Learning in R with SuperLearner

Did you ever want to build a machine learning ensemble, but did not know
how to get started? This tutorial will help you on your way with
`SuperLearner`

. This R package provides you with an easy way to create
machine learning ensembles with the use of high level functions by
offering a standardized wrapper to fit an ensemble using popular R
machine learing libraries such as `glmnet`

, `knn`

, `randomForest`

and
many more!

In this tutorial, you'll tackle the following topics:

- What Are Ensembles? Go over a short definition of ensembles before you start tackling the practical example that this tutorial offers!
- Why
`SuperLearner`

and what does this package actually do? - Ensemble Learning in R with
`SuperLearner`

: in this section, you'll learn how to install the packages you need, prepare the data and create your first ensemble model! You'll also see how you can train the mode and make predictions with it. In doing so, you'll cover Kernel Support Vector Machines, Bayes Generalized Linear Models and Bagging. Lastly, you'll see how you can tune the hyperparameters to further improve your model's performance!

When you are finished, you will have fit your first ensemble, predicted new data and tuned parts of the ensemble.

## What Are Ensembles?

All this is awesome, but what exactly is an ensemble?

An ensemble occurs when the probability predictions or numerical predictions of multiple machine models are combined by averaging, weighting each model and adding them together or using the most common observation between models. This provides a multiple vote scenario that is likely to drive a prediction to the correct class or closer to the correct number in regression models. Ensembles tend to work best when there are disagreements between the models being fit. The concept of combining multiple models also seems to perform well in practice, often above implementations of single algorithms.

Ensembles can be created manually by fitting multiple models, predicting with each of them and then combining them.

## Why `SuperLearner`

?

Now that you have seen what ensembles are, you might ask yourself what
the `SuperLearner`

library exactly does. Well, simply put,
`SuperLearner`

is an algorithm that uses cross-validation to estimate
the performance of multiple machine learning models, or the same model
with different settings. It then creates an optimal weighted average of
those models, which is also called an “ensemble”, using the test data
performance.

But why would you use `SuperLearner`

?

Even though you'll learn more about the power of this R package throughout the tutorial, you could already consider this list of advantages:

`SuperLearner`

allows you to fit an ensemble model by simply adding algorithms- As you already read before,
`SuperLearner`

uses cross-validation, which is inherently used to estimate risk for all models. This makes`SuperLearner`

great for model comparison! `SuperLearner`

makes ensembling efficient by automatically estimating the weights of the ensemble. This is normally a task that can be very tedious and requires a lot of experimentation.`SuperLearner`

automatically removes models that do not contribute to the ensemble prediction power, this leaves you free to experiment with numerous algorithms!

Let's take a look at the process to use `SuperLearner`

.

## Ensemble Learning in R with `SuperLearner`

### Install The `SuperLearner`

Package

`SuperLearner`

can be installed from CRAN with the `install.packages()`

function and then loaded into your workspace using the `library()`

function:

```
# Install the package
install.packages("SuperLearner")
# Load the package
library("SuperLearner")
```

### Prepare Your Data

To illustrate `SuperLearner`

, you will use the Pima Indian Women data
set from the
`MASS`

package. The MASS package contains a training set, which is used for
training a model and a test set, which is used for assessing the
performance of the model on unseen data. The data set provides some
descriptive factors about the Pima Indian Women such as number of
pregnancies and age and whether or not they have diabetes. The purpose
of the data set is to try to predict diabetes.

The `type`

column is the column that indicates the presence of diabetes.
It is a binary `Yes`

or `No`

column, which means that it follows a
binomial distribution.

**Note** that, without getting too theoretical, a binomial distribution
is a collection of Bernoulli trials, which are a success or failure test
in probability. A binomial distribution is easily identified because
there are only two possible responses, in this case `Yes`

or `No`

. Why
are you getting into this? Well, `SuperLearner`

requires you to define
the family of problem your model should belong to. You will see that in
more detail when you fit the model later in this tutorial.

```
# Get the `MASS` library
library(MASS)
# Train and test sets
train <- Pima.tr
test <- Pima.te
# Print out the first lines of `train`
head(train)
```

```
## npreg glu bp skin bmi ped age type
## 1 5 86 68 28 30.2 0.364 24 No
## 2 7 195 70 33 25.1 0.163 55 Yes
## 3 5 77 82 41 35.8 0.156 35 No
## 4 0 165 76 43 47.9 0.259 26 No
## 5 0 107 60 25 26.4 0.133 23 No
## 6 5 97 76 27 35.6 0.378 52 Yes
```

```
# Get a summary of `train`
summary(train)
```

```
## npreg glu bp skin
## Min. : 0.00 Min. : 56.0 Min. : 38.00 Min. : 7.00
## 1st Qu.: 1.00 1st Qu.:100.0 1st Qu.: 64.00 1st Qu.:20.75
## Median : 2.00 Median :120.5 Median : 70.00 Median :29.00
## Mean : 3.57 Mean :124.0 Mean : 71.26 Mean :29.21
## 3rd Qu.: 6.00 3rd Qu.:144.0 3rd Qu.: 78.00 3rd Qu.:36.00
## Max. :14.00 Max. :199.0 Max. :110.00 Max. :99.00
## bmi ped age type
## Min. :18.20 Min. :0.0850 Min. :21.00 No :132
## 1st Qu.:27.57 1st Qu.:0.2535 1st Qu.:23.00 Yes: 68
## Median :32.80 Median :0.3725 Median :28.00
## Mean :32.31 Mean :0.4608 Mean :32.11
## 3rd Qu.:36.50 3rd Qu.:0.6160 3rd Qu.:39.25
## Max. :47.90 Max. :2.2880 Max. :63.00
```

**Tip**: if you want to have more information on the variables of this
data set, use the `help()`

function, just like here:

```
help(Pima.tr)
```

By running the above command, you can derive that the `type`

column
indicates diabetes.

`SuperLearner`

also requires the response variable to be encoded if it
is a classification problem. Since you are solving a binomial
classification problem, you will encode the factor for the variable
`type`

to 0-1 encoding:

```
y <- as.numeric(train[,8])-1
ytest <- as.numeric(test[,8])-1
```

Since the `type`

column was a factor, R will encode it to 1 and 2, but
this is not what you want: ideally, you would like to work with the type
encoded as 0 and 1, which are "No" and "Yes", respectively. In the above
code chunk, you subtract `1`

from the whole set to get your 0-1
encoding. R will also encode this in the factor order.

The package also requires that the predictors (`X`

) and responses (`Y`

)
to be in their own data structures. You split out `Y`

above, now you
need to split out `X`

. You will go ahead and split out your test set as
well:

```
x <- data.frame(train[,1:7])
xtest <- data.frame(test[,1:7])
```

**Note** that some algorithms do not just require a data frame, but
would require a model matrix saved as a data frame. An example is the
`nnet`

algorithm. When solving a regression problem, you will almost
always use the model matrix to store your data for SuperLearner. All a
model matrix does is split out factor variables into their own columns
and recodes them as 0-1 values instead of text values. It does not
impact numerical columns. The model matrix will increase the number of
columns an algorithm has to deal with, therefore it could increase
computational time. For a small data set, such as this, there is minimal
impact, but larger data sets could be heavily affected. The moral of the
story is to decide which algorithms you will want to try before fitting
your model. For this simple example, you will just use the data frame
for the existing data structure.

### Your First Ensemble Model with `SuperLearner`

To start creating your first model, you can use the following command to preview what models are available in the package:

```
listWrappers()
```

```
## All prediction algorithm wrappers in SuperLearner:
## [1] "SL.bartMachine" "SL.bayesglm" "SL.biglasso"
## [4] "SL.caret" "SL.caret.rpart" "SL.cforest"
## [7] "SL.dbarts" "SL.earth" "SL.extraTrees"
## [10] "SL.gam" "SL.gbm" "SL.glm"
## [13] "SL.glm.interaction" "SL.glmnet" "SL.ipredbagg"
## [16] "SL.kernelKnn" "SL.knn" "SL.ksvm"
## [19] "SL.lda" "SL.leekasso" "SL.lm"
## [22] "SL.loess" "SL.logreg" "SL.mean"
## [25] "SL.nnet" "SL.nnls" "SL.polymars"
## [28] "SL.qda" "SL.randomForest" "SL.ranger"
## [31] "SL.ridge" "SL.rpart" "SL.rpartPrune"
## [34] "SL.speedglm" "SL.speedlm" "SL.step"
## [37] "SL.step.forward" "SL.step.interaction" "SL.stepAIC"
## [40] "SL.svm" "SL.template" "SL.xgboost"
##
## All screening algorithm wrappers in SuperLearner:
## [1] "All"
## [1] "screen.corP" "screen.corRank" "screen.glmnet"
## [4] "screen.randomForest" "screen.SIS" "screen.template"
## [7] "screen.ttest" "write.screen.template"
```

You will notice there are prediction algorithm wrappers and screening
algorithm wrappers. There are some popular libraries in here that can be
used for either classification, regression or both. The screening
algorithms are used for automated variable selection by `SuperLearner`

.

When you want to use an algorithm from the above list, you'll need to
have the package installed in your environment. That's because
`SuperLearner`

is really calling these packages and then fitting the
models when the method is used. That also means that if you never use
the method `SL.caret`

, for example, you do not need to have the `caret`

package installed.

Fitting the model is simple, but you'll go through this step-by-step with a single model example.

You will fit the Ranger algorithm, which is a faster implementation of the famous Random Forest.

**Remember** that a Random Forest is a powerful method which is actually
an ensembling of decision trees. Decision trees work by observing your
data and calculating a probability split between each variable in the
model, giving you a pathway to your prediction. Decision trees have a
habit of overfitting to their data, which means they do not generalize
well to new data. Random Forest solves this problem by growing multiple
decision trees based on numerous samples of data and then averages those
predictions to find the correct prediction. It also only selects a
subset of the features for each sample, which is how it differs from
tree bagging. This creates a model that is not overfitting the data.
Cool, right?

In this case, it could be that you first need to install the `ranger`

library with `install.packages()`

function before you can start fitting
the model.

If you have done that, you can continue and use `SL.ranger`

in the
`SuperLearner()`

function.

Since Random Forest -and therefore Ranger- contain random sampling in
the algorithm, you will not get the same result if you fit it more than
once. Therefore, for this exercise, you will set the seed so you can
reproduce the examples and also compare multiple models on the same
random seed baseline. R uses `set.seed()`

to set the random seed. The
seed can be any number, in this case, you will use `150`

.

```
set.seed(150)
single.model <- SuperLearner(y,
x,
family=binomial(),
SL.library=list("SL.ranger"))
```

`SuperLearner`

requires a `Y`

variable, which is the response or outcome
you want, an `X`

variable, which are the predictor variables, the
`family`

to use, which can be guassian or binomial and the library to
use in the form of a list. That's `SL.ranger`

in this case.

Do you remember the whole binomial distribution discussion that you read about earlier? Now, you see why you needed to know that: using the gaussian model would not have yielded proper predictions in your 0-1 range.

Next, simply printing the model provides the coefficient, which is the weight of the algorithm in the model and the risk factor which is the error the algorithm produces. Behind the scenes, the package fits each algorithm used in the ensemble to produce the risk factor.

```
single.model
```

```
##
## Call:
## SuperLearner(Y = y, X = x, family = binomial(), SL.library = list("SL.ranger"))
##
##
##
## Risk Coef
## SL.ranger_All 0.1759541 1
```

In this case, your risk factor is less than 0.20. Of course, this will
need to be tested through external cross validation and in the test set,
but it is a good start. The beauty of `SuperLearner`

is that it tries to
automatically build an ensemble through the use of cross validation. Of
course, if there is only one model, then it gets the full weight of the
ensemble.

So this single model is great, but you can do this without
`SuperLearner`

. How can you fit ensemble models?

### Training an Ensemble with R: Kernel Support Vector Machines, Bayes GLM and Bagging

Ensembling with SuperLearner is as simple as selecting the algorithms to
use. In this case, let's add Kernel Support Vector Machines (KSVM) from
the `kernlab`

package, Bayes Generalized Linear Models (GLM) from the
`arm`

package and
bagging from the `ipred`

package.

But what are KSVM and Bayes GLM?

The KSVM uses something called "the kernel trick" to calculate distance between points. Instead of having to draw a map of the features and calculate coordinates, the kernel method calculates the inner products between points. This allows for faster computation. Then the support vector machine is used to learn the non-linear boundary between points in classification. A support vector machine attempts to create a gap between two classes in a machine learning problem that is often nonlinear. It then classifies new points on either side of that gap based on where they are in space.

The Bayes GLM model is simply an implementation of logistic regression. At least in this case, where you are classifying a 0-1 problem. Bayes GLM differs from KSVM in that it uses an augmented regression algorithm to update the coefficients at each step. Bagging is similar to random forest above without subsetting the features. This means that you will grow multiple decision trees from random samples and average them together to get your prediction.

Now let's fit your first ensemble!

**Tip**: don't forget to install these packages if you don't have them
yet! Additionally, you might also be prompted to install other required packages.

```
# Set the seed
set.seed(150)
# Fit the ensemble model
model <- SuperLearner(y,
x,
family=binomial(),
SL.library=list("SL.ranger",
"SL.ksvm",
"SL.ipredbagg",
"SL.bayesglm"))
# Return the model
model
```

```
##
## Call:
## SuperLearner(Y = y, X = x, family = binomial(), SL.library = list("SL.ranger",
## "SL.ksvm", "SL.ipredbagg", "SL.bayesglm"))
##
##
## Risk Coef
## SL.ranger_All 0.1756230 0.000000
## SL.ksvm_All 0.1838340 0.000000
## SL.ipredbagg_All 0.1664828 0.524182
## SL.bayesglm_All 0.1677593 0.475818
```

Adding these algorithms improved your model and changed the landscape.
Ranger and KVSM have a coefficient of zero, which means that it is not
weighted as part of the ensemble anymore. Bayes GLM and Bagging make up
the rest of the weight of the model. You will notice `SuperLearner`

is
calculating this risk for you and deciding on the optimal model mix that
will reduce the error.

To understand each model's specific contribution to the model and the
variation, you can use `SuperLearner`

's internal cross-validation
function `CV.SuperLearner()`

. To set the number of folds, you can use the `V`

argument. In this case, you will set it to `5`

:

```
# Set the seed
set.seed(150)
# Get V-fold cross-validated risk estimate
cv.model <- CV.SuperLearner(y,
x,
V=5,
SL.library=list("SL.ranger",
"SL.ksvm",
"SL.ipredbagg",
"SL.bayesglm"))
# Print out the summary statistics
summary(cv.model)
```

```
##
## Call:
## CV.SuperLearner(Y = y, X = x, V = 5, SL.library = list("SL.ranger",
## "SL.ksvm", "SL.ipredbagg", "SL.bayesglm"))
##
## Risk is based on: Mean Squared Error
##
## All risk estimates are based on V = 5
##
## Algorithm Ave se Min Max
## Super Learner 0.17277 0.014801 0.16250 0.19557
## Discrete SL 0.17964 0.014761 0.16363 0.19244
## SL.ranger_All 0.17866 0.015004 0.14811 0.20518
## SL.ksvm_All 0.19382 0.020301 0.15685 0.26215
## SL.ipredbagg_All 0.17791 0.015858 0.15831 0.19244
## SL.bayesglm_All 0.16628 0.014318 0.15322 0.18022
```

The summary of cross validation shows the average risk of the model, the variation of the model and the range of the risk.

Plotting this also produces a nice plot of the models used and their variation:

```
plot(cv.model)
```

It's easy to see that Bayes GLM performs the best on average while KSVM
performs the worst and contains a lot of variation compared to the other
models. The beauty of `SuperLearner`

is that, if a model does not fit
well or contribute much, it is just weighted to zero! There is no need
to remove it and retrain unless you plan on retraining the model in the
future. Just remember that proper model training involves cross
validation of the entire model. In a real-world setting, that is how you
would determine the risk of the model before predicting new data.

### Make Predictions with SuperLearner

With the specific command `predict.SuperLearner()`

you can easily make
predictions on new data sets. That means that you can not use the normal
`predict()`

function!

```
predictions <- predict.SuperLearner(model, newdata=xtest)
```

The function `predict.SuperLearner()`

takes a model argument (a
SuperLearner fit model) and new data to predict on. Predictions will
first return the overall ensemble predictions:

```
head(predictions$pred)
```

```
## [,1]
## [1,] 0.79322181
## [2,] 0.11895658
## [3,] 0.04612200
## [4,] 0.05928159
## [5,] 0.68824522
## [6,] 0.54373451
```

It will also return the individual library predictions:

```
head(predictions$library.predict)
```

```
## SL.ranger_All SL.ksvm_All SL.ipredbagg_All SL.bayesglm_All
## [1,] 0.796 0.8089502 0.82086658 0.76276712
## [2,] 0.129 0.1580203 0.18586049 0.04525230
## [3,] 0.016 0.1579566 0.06255427 0.02801949
## [4,] 0.102 0.1885473 0.07238268 0.04484885
## [5,] 0.638 0.7108875 0.58791672 0.79877149
## [6,] 0.550 0.6898737 0.37488066 0.72975132
```

This allows you to see how each model classified each observation. This could be useful in debugging the model or fitting multiple models at once to see which to use further.

You may have noticed the prediction quantities being returned. They are in the form of probabilities. That means that you will need a cut off threshold to determine if you should classify a one or zero. This only needs to be done in the binomial classification case, not regression.

Normally, you would determine this in training with cross-validation,
but for simplicity, you will use a cut off of 0.50. Since this is a
simple binomial problem, you will use `dplyr`

's `ifelse()`

function to
recode your probabilities:

```
# Load the package
library(dplyr)
# Recode probabilities
conv.preds <- ifelse(predictions$pred>=0.5,1,0)
```

Now you can build a confusion matrix with `caret`

to review the results:

```
# Load in `caret`
library(caret)
# Create the confusion matrix
cm <- confusionMatrix(conv.preds, ytest)
# Return the confusion matrix
cm
```

```
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 199 45
## 1 24 64
##
## Accuracy : 0.7922
## 95% CI : (0.7445, 0.8345)
## No Information Rate : 0.6717
## P-Value [Acc > NIR] : 8.166e-07
##
## Kappa : 0.5044
## Mcnemar's Test P-Value : 0.01605
##
## Sensitivity : 0.8924
## Specificity : 0.5872
## Pos Pred Value : 0.8156
## Neg Pred Value : 0.7273
## Prevalence : 0.6717
## Detection Rate : 0.5994
## Detection Prevalence : 0.7349
## Balanced Accuracy : 0.7398
##
## 'Positive' Class : 0
##
```

You are getting around 0.7921687 accuracy on this data set, which is good performance for this data set. Many algorithms have scored higher, but this is good for a quick ensemble. With some proper training with cross-validation and trying some different models, it is easy to see how you can quickly improve this score.

### Tuning Hyperparameters

While model performance is not terrible, you can try to improve your
performance by tuning some hyperparameters of some of the models that
you have in the ensemble. Ranger was not weighted heavily in your model,
but maybe that is because you need more trees and need to tune mtry
parameter. Maybe you can improve bagging as well by increasing the
`nbagg`

parameter to `250`

from the default of `25`

.

There are two methods for doing this: either you define a function that
calls the learner and modifies a parameter or you use the
`create.Learner()`

function. In the next sections, you'll learn more
about these options.

#### Defining a Function

The first one is with the help of `function()`

. Here, you would define a
function that calls the learner and modifies a parameter. The function
call uses the ellipsis `...`

to pass along additional arguments to a
function. Those three little dots allow the modification to a formula
without having to specify in the function what those modifications are.
This means if you are changing 10 parameters, you do not need 10 objects
in the function to map within the function. It is a generalizable way to
write a function.

```
SL.ranger.tune <- function(...){
SL.ranger(..., num.trees=1000, mtry=2)
}
SL.ipredbagg.tune <- function(...){
SL.ipredbagg(..., nbagg=250)
}
```

`SL.ranger.tune`

is the name of your modified `ranger`

method and
`SL.ipredbagg.tune`

is the name of your modified `ipredbagg`

method. Now
that you have some new learner functions created, you can pass these
along to the cross validation formula to see if the performance
improves.

**Note** that you will keep the original `SL.ranger`

and `SL.ipredbagg`

functions in the algorithm to see if performance improves on your tuned
versions of the functions.

```
# Set the seed
set.seed(150)
# Tune the model
cv.model.tune <- CV.SuperLearner(y,
x,
V=5,
SL.library=list("SL.ranger",
"SL.ksvm",
"SL.ipredbagg","SL.bayesglm",
"SL.ranger.tune",
"SL.ipredbagg.tune"))
# Get summary statistics
summary(cv.model.tune)
```

```
##
## Call:
## CV.SuperLearner(Y = y, X = x, V = 5, SL.library = list("SL.ranger",
## "SL.ksvm", "SL.ipredbagg", "SL.bayesglm", "SL.ranger.tune", "SL.ipredbagg.tune"))
##
##
## Risk is based on: Mean Squared Error
##
## All risk estimates are based on V = 5
##
## Algorithm Ave se Min Max
## Super Learner 0.17272 0.014969 0.15849 0.19844
## Discrete SL 0.17250 0.014989 0.15645 0.18430
## SL.ranger_All 0.17897 0.015084 0.15388 0.19920
## SL.ksvm_All 0.19573 0.020278 0.16095 0.26304
## SL.ipredbagg_All 0.17667 0.015629 0.16473 0.18898
## SL.bayesglm_All 0.16628 0.014318 0.15322 0.18022
## SL.ranger.tune_All 0.17637 0.014882 0.15218 0.19793
## SL.ipredbagg.tune_All 0.17813 0.015869 0.16455 0.19260
```

```
# Plot the tuned model
plot(cv.model.tune)
```

You can see from this plot that `ipredbagg`

seems to improve as you
increase the `nbagg`

parameter as seen in `SL.ipredbagg.tune`

. `Ranger`

seems to get worse with tuning the parameters, but let's leave it in and
see if SuperLearner finds it to be relevant.

Again, the beauty is `SuperLearner`

will just set it to zero if it is
not relevant. Remember, that the best ensembles are not composed of the
best performing algorithms, but rather the algorithms that best
complement each other to classify a prediction.

Let's fit the new model with tuned parameters and see how they weigh:

```
# Set the seed
set.seed(150)
# Create the tuned model
model.tune <- SuperLearner(y,
x,
SL.library=list("SL.ranger",
"SL.ksvm",
"SL.ipredbagg",
"SL.bayesglm",
"SL.ranger.tune",
"SL.ipredbagg.tune"))
# Return the tuned model
model.tune
```

```
##
## Call:
## SuperLearner(Y = y, X = x, SL.library = list("SL.ranger", "SL.ksvm",
## "SL.ipredbagg", "SL.bayesglm", "SL.ranger.tune", "SL.ipredbagg.tune"))
##
##
##
## Risk Coef
## SL.ranger_All 0.1748247 0.0000000
## SL.ksvm_All 0.1974033 0.0000000
## SL.ipredbagg_All 0.1745503 0.0000000
## SL.bayesglm_All 0.1634855 0.7162423
## SL.ranger.tune_All 0.1725514 0.0000000
## SL.ipredbagg.tune_All 0.1711161 0.2837577
```

`SL.bayesglm`

and `SL.ipredbagg.tune`

are now the only algorithms
weighted in the ensemble. Predicting on the test set gives the following
result:

```
# Gather predictions for the tuned model
predictions.tune <- predict.SuperLearner(model.tune, newdata=xtest)
# Recode predictions
conv.preds.tune <- ifelse(predictions.tune$pred>=0.5,1,0)
# Return the confusion matrix
confusionMatrix(conv.preds.tune,ytest)
```

```
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 200 43
## 1 23 66
##
## Accuracy : 0.8012
## 95% CI : (0.7542, 0.8428)
## No Information Rate : 0.6717
## P-Value [Acc > NIR] : 1.116e-07
##
## Kappa : 0.5271
## Mcnemar's Test P-Value : 0.01935
##
## Sensitivity : 0.8969
## Specificity : 0.6055
## Pos Pred Value : 0.8230
## Neg Pred Value : 0.7416
## Prevalence : 0.6717
## Detection Rate : 0.6024
## Detection Prevalence : 0.7319
## Balanced Accuracy : 0.7512
##
## 'Positive' Class : 0
##
```

This gives you a little improvement on the test set and illustrates the
concepts of using `SuperLearner`

for model tuning.

`create.Learner()`

The second method for tuning hyperparameters is to use the
`create.Learner()`

function. This allows you to customize an existing
`SuperLearner`

:

```
learner <- create.Learner("SL.ranger", params=list(num.trees=1000, mtry=2))
learner2 <- create.Learner("SL.ipredbagg", params=list(nbagg=250))
```

The learner character string is the first argument to the
`create.Learner()`

function. Then you pass a list of the parameters to
modify. This will create an object:

```
learner
```

```
## $grid
## NULL
##
## $names
## [1] "SL.ranger_1"
##
## $base_learner
## [1] "SL.ranger"
##
## $params
## $params$num.trees
## [1] 1000
##
## $params$mtry
## [1] 2
```

Now, when passing the learner to SuperLearner, you use the names object in the learner object:

```
# Set the seed
set.seed(150)
# Create a second tuned model
cv.model.tune2 <- CV.SuperLearner(y,
x,
V=5,
SL.library=list("SL.ranger",
"SL.ksvm",
"SL.ipredbagg",
"SL.bayesglm",
learner$names,
learner2$names))
# Get summary statistics
summary(cv.model.tune2)
```

```
##
## Call:
## CV.SuperLearner(Y = y, X = x, V = 5, SL.library = list("SL.ranger",
## "SL.ksvm", "SL.ipredbagg", "SL.bayesglm", learner$names, learner2$names))
##
##
## Risk is based on: Mean Squared Error
##
## All risk estimates are based on V = 5
##
## Algorithm Ave se Min Max
## Super Learner 0.17272 0.014969 0.15849 0.19844
## Discrete SL 0.17250 0.014989 0.15645 0.18430
## SL.ranger_All 0.17897 0.015084 0.15388 0.19920
## SL.ksvm_All 0.19573 0.020278 0.16095 0.26304
## SL.ipredbagg_All 0.17667 0.015629 0.16473 0.18898
## SL.bayesglm_All 0.16628 0.014318 0.15322 0.18022
## SL.ranger_1_All 0.17637 0.014882 0.15218 0.19793
## SL.ipredbagg_1_All 0.17813 0.015869 0.16455 0.19260
```

```
# Plot `cv.model.tune2`
plot(cv.model.tune2)
```

The end result is the same as if you used the first method. It is up to you to use whatever method you desire.

## More Ensemble Models and Machine Learning in R

Wow, you covered a lot of ground! By now, you should have a good handle on the SuperLearner and should have successfully fit your first ensemble with SuperLearner. This package makes it nice and easy to add models really quickly. There are some subtlies with methods and what data form to use. However, when in doubt, a model matrix saved as a data frame almost always works.

As a reminder, you installed and loaded `SuperLearner`

, formatted your
dataset, fit a single model, fit your first ensemble, predicted with the
ensemble and tuned some hyperparameters!

The next steps would be to tackle some more advanced topics with this package, such as parallelization, feature selection and screening, using model matrices, writing your own SuperLearner and ensemble cross validation.