69 observations and 7 columns
Lecture 14: Bootstrap Aggregating (Bagging) and Random Forests
University of British Columbia Okanagan
We’ve just wrapped up some discussion on decisions trees, or CARTs which suggests that they are not particularly “good” models in many scenarios.
Why? Bushy trees tend to have high variance and pruned trees have high bias
Bagging is a procedure for reducing the variance of a statistical learning method but as with all model improvements, we will have to accept some drawbacks…
Trees have one aspect that prevents them from being ideal tool for predictive learning, namely inaccuracy. - ESL
Today we’ll be looking at some ensemble methods to make them more competitve
Ensemble methods are machine learning techniques that combine the predictions of multiple models to improve overall predictive performance.
They can be divided in two groups:
Today we’ll focus on parallel learners …
Bagging (or bootstrap-aggregating) can be thought of as applying the bootstrap to an entire model-fitting process, rather than just to estimate standard errors of estimators.
By generating \(B\) bootstrapped samples, we can train \(B\) models, \(\hat f^{*b}\) for \(b = 1, \dots B\), and average their predictions \[\begin{equation} \hat f_{bag}(x) = \frac{\sum_{b=1}^B f^{*b}(x^*_b)}{B} \end{equation}\]
These trees for each bootstrap sample are grown deep (i.e. “bushy”), and are not pruned.
Consequently, each individual tree will have high variance, but low bias
Averaging these \(B\) trees reduces the variance thus improving predictions and stability.
Why? Think of independent observations \(X_1, …, X_n\) having variance \(\sigma^2\); Central Limit Theorem tells us that their average value, i.e. \(\overline{X}\) will have variance \(\frac{\sigma^2}{n} (<\sigma^2\))$
The main idea is to pool a collection of random decision trees
We could get an estimate of error using the validation approach, however, we can get a different estimate “for free”
Recall: a bootstrap sample consists of sampling \(n\) observations1 with replacement from our dataset
As we are sampling with replacement, some observations are randomly left out while others are duplicated.
It can be shown that for each bootstrap sample, on average roughly2 \(\frac{1}{3}\) of the observations will not appear …
\[\begin{align} P(x_i &\text{ is not selected}) =\left(1 - \frac{1}{n}\right)^n \\ & = P(x_i \text{ is not selected on first draw}) \times \dots\\ &\quad \dots \times P(x_i \text{ is not selected on $n^{th}$ draw}) \\ &= \left(\frac{n-1}{n}\right)\times \dots \times \left(\frac{n-1}{n}\right)\\ &= \lim_{n \to -\infty} \left(1 - \frac{1}{n}\right)^n = \frac{1}{e} \end{align}\] where \(e\) is the exponential constant approximately 2.718.
Image source: rasbt.github
The figure above illustrates how three random bootstrap samples drawn from an ten-sample dataset \((x_1,x_2,...,x_{10})\) and their ‘out-of-bag’ sample for testing may look like.
Observations left out of a model fit are called out of bag or OOB observations.
These OOB observations are being “hidden” from the model fitting process in the same way our validation set was in CV
For bagging, the “OOB validation/test set” will contain roughly two thirds of the observations and the training set will contain n observations1
It turns out that we can use these CV-like validation sets to produce CV-like test errors of a bagged model!
We predict the response for the \(i\)th observation using each of the trees in which \(i\) was left out
This will yield approximately \(B/3\) predictions for the \(i\)th observation, which we average.
The respective OOB error estimates1 for regression and classification, are: \[\begin{align} \text{RSS} & = \sum_{i=1}^n (y_i - \hat y_i)^2 & &\frac{1}{n} \sum_{i=1}^n I(y_i \neq \hat y_i ) \end{align}\]
So if we wanted to predict the response for \(x_3\) we would use the trees fitted to Bootstrap sample 1 and Boostrap sample 3 and average the result.
Recall the beer
data from last lecture (names removed):
69 observations and 7 columns
OLS Regression:
Regression tree
Let’s calculate the CV RSS error for OLS Regression:
[1] 57.33652
and the (actual) CV RSS error estimate for Regression Trees…
library(randomForest); set.seed(134891)
p = 5 # number of predictors
(beerbag <- randomForest(price~., data=beerdat, ntree=500, mtry = p))
...
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 5
Mean of squared residuals: 0.9338468
% Var explained: 54.71
...
MSE = RSS/n \(\implies\) RSS = MSE \(\times\) \(n\) = 0.93 \(\times\) 69 = 64.44.
Bagging has an improvement over the single tree (CV RSS of 77.57), but not over OLS regression (CV RSS of 57.34)!
It’s worth noting that comparison we just made…
Technically, we have not performed cross validation on our bagged trees, and yet we are comparing the (OOB) error to properly cross-validated errors for LM and CART
OOB estimation is a similar (and more efficient) way for providing realistic long term error estimates
Takeaway: For practical purposes, OOB error from bagged models can be compared to CV error from other models.
Two measures are provided for variable “importance”1.
The plot and values are produced using the following:
Jump to RF varImpPlot
Calculate the OOB MSE for the full set of \(B\) trees in your random forest, say, \(\text{MSE}_{full}\)
For the variable in question, randomly shuffle the observations (leave the order correct for the other columns).
Carry out predictions, calculate \(\text{MSE}_{shuff}\) then check the change from the full version for the variable in question: \[\begin{align} \text{%IncMSE} = \frac{\text{MSE}_{{shuff}}-\text{MSE}_{full}}{\text{MSE}_{full}} \end{align}\]
Higher values indicates that the variable is more important
IncNodePurity is the total increase in node purity obtained from splitting on the variable, averaged over all trees in the forest.
In other words, IncNodePurity is the increase in homogeneity in the data partitions.
For regression, it is measured by residual sum of squares.
A higher increase of node purity indicates that the variable is more important.
N.B. For classification, MeanDecreaseAccuracy and MeanDecreaseGini are used in place of %IncMSE and IncNodePurity.
Cons
Pros
Bagging has a close relationship with random forests (RF)
Like bagging, RF build a number of trees on bootstrapped training samples; however, each time a split in a tree is considered, a random sample of \(m\) predictors (where \(m < p\) = the total number of predictors) is considered.
Typically we choose \(m = \sqrt{p}\) and obtain a new sample of \(m\) predictors at each split.
Suppose that there is one very strong predictor in the data set (eg. bitter
in the beer
example)
In the collection of bagged trees, most or all of the trees will likely use this strong predictor in the top split, hence, all of the bagged trees will look quite similar to each other.
As discussed in our CV lecture, averaging highly correlated models does not lead to as large of a reduction in variance as averaging many uncorrelated quantities.
Random forests provide an improvement over bagged trees by way of a small tweak that decorrelates the trees.
This decreases the dependency (or correlation) across the \(B\) trees, decreasing the variance of the averaged model (think back to LOOCV) and therefore increasing stability.
As with bagging, we still loose interpretability over simple trees, but still gain the variable importance measure.
Bagging is a special case of random forests when \(m = p\)
Create \(B\) bootstrap samples
Fit a tree* to each bootstrap sample,
* when creating nodes in each decision tree, only a random subset of features is considered for splitting at each node.
Each decision tree is used to make a prediction.
Aggregate the \(B\) predictions.
We use the randomForest
function from the randomForest package to implement bagging and RF. Usage:
ntree
: the number of trees to grow (default = 500)mtry
: the number of variables randomly selected at each node (default in classification is \(\sqrt{p}\) and \(p/3\) in regression; where \(p\) is the number of predictors).importance
Should importance of predictors be assessed?How do we determine how many subsets of features to consider at each split?
mtry
is a hyperparameter of our model and it’s choice can significantly impact the performance of our random forest.
In practice the default values are reasonable starting point and often works well in practice.
Alternatively you could use cross-validation to choose; e.g. build a random forest with mtry
= \(\sqrt{p}-5\), \(\sqrt{p}\), \(\sqrt{p}+5\) and select the RF with highest accuracy
beer
mtry
= \(\sqrt{p}\) by default1
...
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 2
Mean of squared residuals: 0.8469584
% Var explained: 58.93
...
OOB estimate or RSS = MSE\(\times n\) = 0.847 * 69 = 58.44
Rank | Method | RSS estimate |
---|---|---|
1 | OLS | 57.34 |
2 | RF | 58.44 |
3 | Bagging | 64.44 |
4 | Single Tree | 77.57 |
For the beer
data set, RF does better than single tree and bagging; however, it does still not outperform OLS!
Jump to Bagging varImpPlot
The smaller (more common) Italian red wine data set
178 samples of wine from three varietals (Barolo, Barbera, Grignolino) grown in the Piedmont region of Italy.
13 chemical measurements, i.e. \(p=13\)
Let’s compare a single classification tree to a tree found using bagging and random forests …
n <- nrow(wine)
min.idx <- which.min(winetreecv$dev)
# Misclassification rate
(mscr <- winetreecv$dev[min.idx]/n)
[1] 0.08426966
15/178 \(\implies\) 8.43% misclassifications in the long-run.
wine
Here we are saying we want to do bagging mtry
= \(m = p\)
The importance
argument tells the aglorithm that we want it to compute variable importance.
Note that now, we don’t have a single tree which we can plot.
Call:
randomForest(formula = factor(Class) ~ ., data = wine, mtry = 13, importance = TRUE)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 13
OOB estimate of error rate: 3.37%
Confusion matrix:
1 2 3 class.error
1 57 2 0 0.03389831
2 1 68 2 0.04225352
3 0 1 47 0.02083333
The OOB error rate of 3.37% versus 8.427% for a standard tree
wine
Jump to RF varImpPlot
# use default: mtry = floor(sqrt(p)) = floor(sqrt(13)) = floor(3.605551)
(wineRF <- randomForest(factor(Class)~., # ntree = 500,
data=wine, importance=TRUE))
...
Number of trees: 500
No. of variables tried at each split: 3
OOB estimate of error rate: 1.12%
Confusion matrix:
1 2 3 class.error
1 59 0 0 0.00000000
2 1 69 1 0.02816901
3 0 0 48 0.00000000
...
\(\quad\) Simple Tree (8.43%) < Bagging (3.37%) < RF (1.12%)
wine
Jump to bagging varImpPlot
Comments
Bagging is considered an ensemble method since it combines many simple models (“weak” learners) in order to obtain a single and potentially very powerful model.
The textbook suggests \(B=100\) as generally sufficient, but probably 500 to 1000 is safer (500 is the default).
While bagging can improve predictions for many regression methods, it is particularly useful for decision trees.
For classification trees, we take a majority vote, i.e. the most commonly occurring prediction.