Data 311: Machine Learning

Lecture 14: Bootstrap Aggregating (Bagging) and Random Forests

Dr. Irene Vrbik

University of British Columbia Okanagan

Introduction

We’ve just wrapped up some discussion on decisions trees, or CARTs which suggests that they are not particularly “good” models in many scenarios.
Why? Bushy trees tend to have high variance and pruned trees have high bias
Bagging is a procedure for reducing the variance of a statistical learning method but as with all model improvements, we will have to accept some drawbacks…

Preface

Trees have one aspect that prevents them from being ideal tool for predictive learning, namely inaccuracy. - ESL

Today we’ll be looking at some ensemble methods to make them more competitve

Ensemble Methods

Ensemble methods are machine learning techniques that combine the predictions of multiple models to improve overall predictive performance.
They can be divided in two groups:
- Parallel learners, e.g. Random Forests and Bagging
- Sequential learnins, e.g. Boosting (topic of next lecture)

Today we’ll focus on parallel learners …

Bagging

Bagging (or bootstrap-aggregating) can be thought of as applying the bootstrap to an entire model-fitting process, rather than just to estimate standard errors of estimators.
By generating $B$ bootstrapped samples, we can train $B$ models, $\hat f^{*b}$ for $b = 1, \dots B$, and average their predictions \[\begin{equation} \hat f_{bag}(x) = \frac{\sum_{b=1}^B f^{*b}(x^*_b)}{B} \end{equation}\]

Bye bye pruning

These trees for each bootstrap sample are grown deep (i.e. “bushy”), and are not pruned.
Consequently, each individual tree will have high variance, but low bias
Averaging these $B$ trees reduces the variance thus improving predictions and stability.
Why? Think of independent observations $X_1, …, X_n$ having variance $\sigma^2$; Central Limit Theorem tells us that their average value, i.e. $\overline{X}$ will have variance $\frac{\sigma^2}{n} (<\sigma^2$)$

Comments

Bagging is considered an ensemble method since it combines many simple models (“weak” learners) in order to obtain a single and potentially very powerful model.
The textbook suggests $B=100$ as generally sufficient, but probably 500 to 1000 is safer (500 is the default).
While bagging can improve predictions for many regression methods, it is particularly useful for decision trees.
For classification trees, we take a majority vote, i.e. the most commonly occurring prediction.

General Idea

The main idea is to pool a collection of random decision trees

Showing random forest decision model for classification. n tress are fitted (Three trees are shown); the first tree predicts A, the second tree predicts A and the n^th three predicts B. Since there are more A predictions than B predictions the majority vote would dictate the final prediction of A.

Source

Steps for Bagging

Create $B$ bootstrap samples¹
Fit a (bushy) tree to each bootstrap sample
Each decision tree is used to make a prediction.
Aggregate the $B$ predictions:
- for regression predictions are typically averaged
- for classification we use a majority vote.

Model Assessment

We could get an estimate of error using the validation approach, however, we can get a different estimate “for free”

Recall: a bootstrap sample consists of sampling $n$ observations¹ with replacement from our dataset
As we are sampling with replacement, some observations are randomly left out while others are duplicated.
It can be shown that for each bootstrap sample, on average roughly² $\frac{1}{3}$ of the observations will not appear …

Probability calculations

\[\begin{align} P(x_i &\text{ is not selected}) =\left(1 - \frac{1}{n}\right)^n \\ & = P(x_i \text{ is not selected on first draw}) \times \dots\\ &\quad \dots \times P(x_i \text{ is not selected on $n^{th}$ draw}) \\ &= \left(\frac{n-1}{n}\right)\times \dots \times \left(\frac{n-1}{n}\right)\\ &= \lim_{n \to -\infty} \left(1 - \frac{1}{n}\right)^n = \frac{1}{e} \end{align}\] where $e$ is the exponential constant approximately 2.718.

Bootstrap Visualization

Image source: rasbt.github

$A visualization of 3 Bootstrap samples taken on a data set with 10 observations. In the first bootstrap sample, $x_3, x_7, x_{10}$ are not selected. In the second bootstrap sample, $x_6, x_9$ are not selected. In the third bootstrap sample, $x_3, x_7, x_8, x_{10}$ are not selected$

The figure above illustrates how three random bootstrap samples drawn from an ten-sample dataset $(x_1,x_2,...,x_{10})$ and their ‘out-of-bag’ sample for testing may look like.

Recall 5-fold CV Visualization

Out of Bag (OOB)

Observations left out of a model fit are called out of bag or OOB observations.
These OOB observations are being “hidden” from the model fitting process in the same way our validation set was in CV
For bagging, the “OOB validation/test set” will contain roughly two thirds of the observations and the training set will contain n observations¹
It turns out that we can use these CV-like validation sets to produce CV-like test errors of a bagged model!

OOB Estimation

We predict the response for the $i$th observation using each of the trees in which $i$ was left out
This will yield approximately $B/3$ predictions for the $i$th observation, which we average.
The respective OOB error estimates¹ for regression and classification, are: \[\begin{align} \text{RSS} & = \sum_{i=1}^n (y_i - \hat y_i)^2 & &\frac{1}{n} \sum_{i=1}^n I(y_i \neq \hat y_i ) \end{align}\]

So if we wanted to predict the response for $x_3$ we would use the trees fitted to Bootstrap sample 1 and Boostrap sample 3 and average the result.

Same visualization that appears on the 'Boostrap Visualization' slide

Example: Beer

Recall the beer data from last lecture (names removed):

69 observations and 7 columns

Example: Beer OLS Regression

OLS Regression:

beerdat <- beerdat[,-1] 
attach(beerdat)
# linear model
beerlm <- lm(price~., data=beerdat)

Regression tree

library(tree);  par(mar=c(0,0,0,0))
beert <- tree(price~., data = beerdat) 
plot(beert)
text(beert, digits=4)

Cross-validation Errors

Let’s calculate the CV RSS error for OLS Regression:

library(DAAG)
beercv<-cv.lm(data=beerdat, beerlm, m=10,printit=FALSE)

sum((beercv$price-beercv$cvpred)^2) # 10-fold CV RSS for lm

[1] 57.33652

and the (actual) CV RSS error estimate for Regression Trees…

library(tree); set.seed(456); 
cv.beert <- cv.tree(beert); 
#10-fold CV RSS for 6 terminal tree
min(cv.beert$dev)

[1] 77.56593

Bagging Regression Tree

library(randomForest); set.seed(134891)
p = 5 # number of predictors
(beerbag <- randomForest(price~., data=beerdat, ntree=500, mtry = p))

...
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 5

          Mean of squared residuals: 0.9338468
                    % Var explained: 54.71
...

MSE = RSS/n $\implies$ RSS = MSE $\times$ $n$ = 0.93 $\times$ 69 = 64.44.

Bagging has an improvement over the single tree (CV RSS of 77.57), but not over OLS regression (CV RSS of 57.34)!

Error Comments

It’s worth noting that comparison we just made…
Technically, we have not performed cross validation on our bagged trees, and yet we are comparing the (OOB) error to properly cross-validated errors for LM and CART
OOB estimation is a similar (and more efficient) way for providing realistic long term error estimates
Takeaway: For practical purposes, OOB error from bagged models can be compared to CV error from other models.

Variable Importance

We can obtain an overall summary of the importance of each predictor if requested using:

beer2 <- randomForest(price~., data=beerdat, ntree=500, 
      mtry = 5, importance = TRUE)

Two measures are provided for variable “importance”¹.
The plot and values are produced using the following:

varImpPlot(beer2)   # plot
importance(beer2)   # values (not shown in slides)

Bagging Variable Importance Plots

Jump to RF varImpPlot

%IncMSE

Calculate the OOB MSE for the full set of $B$ trees in your random forest, say, $\text{MSE}_{full}$
For the variable in question, randomly shuffle the observations (leave the order correct for the other columns).
Carry out predictions, calculate $\text{MSE}_{shuff}$ then check the change from the full version for the variable in question: \[\begin{align} \text{%IncMSE} = \frac{\text{MSE}_{{shuff}}-\text{MSE}_{full}}{\text{MSE}_{full}} \end{align}\]

Higher values indicates that the variable is more important

IncNodePurity

IncNodePurity is the total increase in node purity obtained from splitting on the variable, averaged over all trees in the forest.
In other words, IncNodePurity is the increase in homogeneity in the data partitions.
For regression, it is measured by residual sum of squares.
A higher increase of node purity indicates that the variable is more important.

N.B. For classification, MeanDecreaseAccuracy and MeanDecreaseGini are used in place of %IncMSE and IncNodePurity.

Prediction vs Inference Tradeoff

Cons

We no longer have one model, but an average of many models!
Bagging loses the interpretability of our model
It is comparatively slower than fitting a single tree

Pros

We get an OOB error estimate “for free”
improves prediction over using a single tree
Variable importance
Averaging prevents overfiting

Random forests

Bagging has a close relationship with random forests (RF)
Like bagging, RF build a number of trees on bootstrapped training samples; however, each time a split in a tree is considered, a random sample of $m$ predictors (where $m < p$ = the total number of predictors) is considered.
Typically we choose $m = \sqrt{p}$ and obtain a new sample of $m$ predictors at each split.

Motivating example

Suppose that there is one very strong predictor in the data set (eg. bitter in the beer example)
In the collection of bagged trees, most or all of the trees will likely use this strong predictor in the top split, hence, all of the bagged trees will look quite similar to each other.
As discussed in our CV lecture, averaging highly correlated models does not lead to as large of a reduction in variance as averaging many uncorrelated quantities.

Why it works

Random forests provide an improvement over bagged trees by way of a small tweak that decorrelates the trees.
This decreases the dependency (or correlation) across the $B$ trees, decreasing the variance of the averaged model (think back to LOOCV) and therefore increasing stability.
As with bagging, we still loose interpretability over simple trees, but still gain the variable importance measure.
Bagging is a special case of random forests when $m = p$

Steps for RF

Create $B$ bootstrap samples
Fit a tree* to each bootstrap sample,

* when creating nodes in each decision tree, only a random subset of features is considered for splitting at each node.
Each decision tree is used to make a prediction.
Aggregate the $B$ predictions.

R function

We use the randomForest function from the randomForest package to implement bagging and RF. Usage:

randomForest(formula, data=NULL, ntree=500, mtry, importance=FALSE)

ntree: the number of trees to grow (default = 500)
mtry: the number of variables randomly selected at each node (default in classification is $\sqrt{p}$ and $p/3$ in regression; where $p$ is the number of predictors).
importance Should importance of predictors be assessed?

Optimizing the RF

How do we determine how many subsets of features to consider at each split?
mtry is a hyperparameter of our model and it’s choice can significantly impact the performance of our random forest.
In practice the default values are reasonable starting point and often works well in practice.
Alternatively you could use cross-validation to choose; e.g. build a random forest with mtry = $\sqrt{p}-5$, $\sqrt{p}$, $\sqrt{p}+5$ and select the RF with highest accuracy

Random Forest on `beer`

mtry = $\sqrt{p}$ by default¹

(beerRF <- randomForest(price~., mtry = sqrt(5), # default
            data=beerdat, importance=TRUE))

...
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 2

          Mean of squared residuals: 0.8469584
                    % Var explained: 58.93
...

OOB estimate or RSS = MSE$\times n$ = 0.847 * 69 = 58.44

Comparison

Rank	Method	RSS estimate
1	OLS	57.34
2	RF	58.44
3	Bagging	64.44
4	Single Tree	77.57

For the beer data set, RF does better than single tree and bagging; however, it does still not outperform OLS!

RF Variable Importance Plot

Jump to Bagging varImpPlot

Classification Example

The smaller (more common) Italian red wine data set
178 samples of wine from three varietals (Barolo, Barbera, Grignolino) grown in the Piedmont region of Italy.
13 chemical measurements, i.e. $p=13$
Let’s compare a single classification tree to a tree found using bagging and random forests …

Single Decision Tree

winetree <- tree(factor(Class)~., data=wine)
plot(winetree); text(winetree)

Cost-complexity pruning

Pruned Classification Tree

Cross-validation error estimate

Obtain the test error (in this case misclassificaiton rate) that was found by cross-validation at the minimum value of $\alpha$:

n <- nrow(wine)
min.idx <- which.min(winetreecv$dev)
# Misclassification rate
(mscr <- winetreecv$dev[min.idx]/n)

[1] 0.08426966

15/178 $\implies$ 8.43% misclassifications in the long-run.

Single pruned tree results

Bagging `wine`

library(randomForest); set.seed(41351); 
winebag <- randomForest(factor(Class)~., mtry=13, 
                        data=wine, importance=TRUE)

Here we are saying we want to do bagging mtry = $m = p$
The importance argument tells the aglorithm that we want it to compute variable importance.

Note that now, we don’t have a single tree which we can plot.

Bagging Results

winebag


Call:
 randomForest(formula = factor(Class) ~ ., data = wine, mtry = 13,      importance = TRUE) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 13

        OOB estimate of  error rate: 3.37%
Confusion matrix:
   1  2  3 class.error
1 57  2  0  0.03389831
2  1 68  2  0.04225352
3  0  1 47  0.02083333

The OOB error rate of 3.37% versus 8.427% for a standard tree

Bagging varImpPlot `wine`

Left: MeanDecreaseAccuracy plot. The variables are equally spaced on the y-axis in order of importance. MeanDecreaseAccuracy is plotted on the x-axis. Starting from the top we have: Proline, Flavanoids, Intensity, OD280, Alcohol, Hue, Alcalinity, Magnesium, Phenols, Malic, Ash, Proanthocyanins, and Nonflavanoid, having MeanDecreaseAccuracy of, 46.4, 42.1, 34.7, 15.4, 14.2, 12.3, 7.5, 5.8, 5.8, 4.7, 3.2, 3, and 1, respectively. Right: MeanDecreaseGini plot. The variables are equally spaced on the y-axis in order of importance. MeanDecreaseGini is plotted on the x-axis. Starting from the top we have: Proline, Flavanoids, Intensity, OD280, Alcohol, Hue, Phenols, Alcalinity, Magnesium, Malic, Ash, Proanthocyanins, and Nonflavanoid having MeanDecreaseGini of, 33.4, 30.8, 25.3, 13, 7, 2.9, 0.9, 0.7, 0.7, 0.7, 0.6, 0.5, and 0.1, respectively.