CART: Classification and Regression Trees

DATA 311: Machine Learning

Author
Affiliation

Dr. Irene Vrbik

University of British Columbia Okanagan

Published

November 1, 2025

Introduction

Goal: Learn to fit, tune, and interpret tree-based models: CART, Random Forests (RF), and Boosting.

As its name suggests, Classification and Regression Trees (CART) can be used for both classification and regression. In this lab, you’ll learn how to fit these trees in R and then explore the more recent approach of Random Forests (RF), introduced in the early 2000s.

Random forests enhance simple CART trees using bootstrapping and random feature selection, which helps them compete with more complex models. They also produce variable importance measures and use out-of-bag (OOB) observations to estimate prediction error.

Learning Outcomes

By the end of this lab, you should be able to:

  1. Fit and interpret CART models for both classification and regression problems using tree() and rpart().
  2. Explain the role of the complexity parameter (α or cp) and how it governs the bias–variance trade-off in tree models.
  3. Perform cost–complexity pruning, interpret cross-validation results, and apply the 1-SE rule to select a parsimonious model.
  4. Compare the tree and rpart implementations, understanding that
    • tree() grows a full tree and prunes using cross-validation, while
    • rpart() prunes as it grows using the cp stopping rule.
  5. Assess model performance on a test set using:
    • Misclassification rate for classification trees
    • Mean Squared Error (MSE) and (R^2) for regression trees
  6. Discuss model complexity and interpretability, articulating when a smaller, simpler tree might be preferred over one with slightly lower test error.
  7. (Preview) Transition to ensemble methods (Random Forests and Boosting) as extensions that reduce variance and improve predictive accuracy.

CART packages

Several R packages implement decision trees:

  • rpart (Therneau and Atkinson 2023) — actively maintained and integrated with modern tools such as caret and tidymodels.
  • tree (Ripley 2023) — an older implementation, similar in spirit to rpart, but less actively maintained.

A key distinction is that the tree() function from the tree package performs cost–complexity pruning via a separate function, cv.tree(), whereas rpart incorporates pruning directly through its built-in complexity parameter (cp) table.

Data

We’ll use two datasets:

  • Spam (classification): kernlab::spam. Goal: Predict whether an email is spam (type = spam/nonspam1).

  • Ozone (regression): mlbench::Ozone Goal: Predict daily maximum ozone levels from weather data (I call this variable ozone, the original data set calls it V4)

First we load the necessary libraries:

library(dplyr)
library(ggplot2)
library(rpart)
library(rpart.plot)
library(randomForest)
library(gbm)
library(tree)
library(kernlab)   # spam
library(mlbench)   # Ozone

Next we prepare our data.

Classification data (Spam)

First we load and wrangle

data(spam, package = "kernlab")               # load the data
spam <- spam %>% mutate(type = factor(type))  # ensure factor
spam

As always, check out the help file to learn more about the dataset (i.e. run2 ?spam in your Console of RStudio).

Variable names that starts with num store the frequency of the corresponding number in the e-mail. So in this case num85 indicates the number of times the number “85” occured in the email.

Now we will split into 70% training 30% testing

set.seed(57008) # for reproducibility
id_spam <- sample.int(nrow(spam), size = floor(0.7*nrow(spam)))
spam_tr <- spam[id_spam, ]
spam_te <- spam[-id_spam, ]

Regression data (Ozone)

First we load and wrangle

data(Ozone, package = "mlbench")            # load the data 
oz <- Ozone %>% as_tibble() %>% na.omit()   # drop NAs for simplicity
oz
# Rename columns in Ozone dataset
names(oz) <- c(
  "month",              # 1 = January, ..., 12 = December
  "day",                # day of month
  "day_week",           # 1 = Monday, ..., 7 = Sunday
  "ozone",              # daily max 1-hour avg ozone reading
  "pressure_height",    # 500 mb pressure height (m)
  "wind_speed",         # wind speed (mph) at LAX
  "humidity",           # humidity (%) at LAX
  "temp_sandburg",      # temperature (°F) at Sandburg
  "temp_elmonte",       # temperature (°F) at El Monte
  "inversion_height",   # inversion base height (ft) at LAX
  "pressure_gradient",  # pressure gradient (mmHg) LAX–Daggett
  "inversion_temp",     # inversion base temperature (°F) at LAX
  "visibility"          # visibility (miles) at LAX
)

Now we will split into 70% training 30% testing

set.seed(48805)
id_oz <- sample.int(nrow(oz), size = floor(0.7*nrow(oz)))
oz_tr <- oz[id_oz, ]
oz_te <- oz[-id_oz, ]
oz_tr
Warning

Both packages implement the original CART methodology introduced by Breiman et al. (Breiman et al. 1984).
However, due to differences in defaults and pruning algorithms, they can yield slightly different results — rpart typically produces deeper trees.

Both will work but for the purpose of this lab, I’ll demo the rpart package (see the ISLR2 textbook for examples using tree).

Classification

1️⃣ Fit a classification tree (Spam)

To build the tree using the defaults from both packages we use:

# Fit CART models using both packages
set.seed(2025)
tree_spam  <- tree(type ~ ., data = spam_tr)
rpart_spam <- rpart(type ~ ., data = spam_tr, method = "class")
Note

Setting a seed guarantees that any internal randomization or ordering stays consistent (e.g tie-breaking among equally good splits)

Visualizing the Tree

plot(tree_spam); text(tree_spam)
Figure 1: Classification tree obtained with the default values of tree(), spam data
plot(rpart_spam); text(rpart_spam)
Figure 2: Classification tree obtained with the default values of rpart(), spam data

In the tree plot, the vertical position (height) of each branch reflects the reduction in deviance (impurity) achieved by that split. The higher the split in the tree, the greater the impurity reduction it produced. In contrast, splits lower down in the tree correspond to smaller improvements in node purity, which is why their branches appear closer together. This can sometimes make the plot difficult to read. Luckily we have some good alternative plotting options which I will outline below.

Alternative plotting methods for tree (limited)

The options with this package are limited. To make the plot in Figure 1 more readable, I can adjust the figure dimensions. Increasing the width and height gives more space for labels and prevents overlap between text and branches.

```{r}
#| fig-height: 10
#| fig-width: 8
#| label: fig-stretched
#| fig-cap: "The `tree_spam` decision tree stretched."
plot(tree_spam); text(tree_spam)
```
Figure 3: The tree_spam decision tree stretched.

Alternative plotting methods for rpart

The text in Figure 2 looks a little squished, and some labels get cut off near the edges. To improve this, set uniform = TRUE3 to evenly space the tree vertically. Additionally, we could use xpd = TRUE to expand the plotting device and allow text to extend beyond the plot boundaries.

plot(rpart_spam, uniform=TRUE); text(rpart_spam, xpd = TRUE)
Figure 4: Classification tree obtained with the default values of rpart(), spam data. We adjusted some plotting parameters to make it more readable.

Alternative plotting using rpart.plot

If we want to get extra fancy with our plots, we could also use rpart.plot:

rpart.plot(rpart_spam, main = "Unpruned CART Tree (Spam)")
Figure 5: Classification tree obtained with the default values of rpart(), spam data. We adjusted some plotting parameters to make it more readable.

You’ll notice some additional information provided in the terminal nodes using this approach:

  • Top line: the predicted class, i.e., the majority class of observations in that node.

  • Middle number: the predicted class probability, indicating the proportion of observations in that node belonging to the predicted class.

  • Bottom percentage: the fraction of the entire training data that falls into that node

These percentages provide both a sense of how confident the model is in its prediction at each node (purity) and how large each node is relative to the total dataset.

2️⃣ Prune the tree

The complexity parameter cp in rpart() output corresponds directly to the cost–complexity pruning parameter \(\alpha\) from CART theory (Breiman et al. 1984). When building a decision tree, we’re always balancing two goals:

  1. Good fit to the training data (small error)
  2. Simple model structure (fewer splits / smaller tree)

To formalize this trade-off, CART defines the cost-complexity criterion:

\[ \begin{align*} R_\alpha(T) &= R(T) + \alpha \mid T \mid \end{align*} \tag{1}\]

where

  • \(R(T)\) is total training error

    • for regression \(R(T)\) is the residual sum or squares: \[ RSS = \sum_{m=1}^{\mid T\mid} \sum_{i: x_i \in R_m} (y_i - \hat y_i)^2 \tag{2}\]

    • for classification \(R(T)\) is either

      • the deviance:

        \[ D = -2\sum_m \sum_k n_{mk} \log(\hat{p}{mk}) \]

        where \(n_{nk}\) is the number of observations in region \(m\) belonging to class \(k\) and \(\hat p_{mk}\) is the estimated class proportion in that region, or

      • the misclassification error:

        \[ 1 - \max_k (\hat{p}_{mk}). \]

        where here \(\hat p_{mk}\) represents the proportion of training observations in the \(m^{th}\) region that are from the \(k^{th}\) class.

  • \(\mid T \mid\) is the size of the tree, i.e. total number of terminal nodes (leaves)

  • \(\alpha\) = penalty for model complexity

A large \(\alpha\) favours a smaller tree, while \(\alpha =0\) will always select the largest possible tree as being optimal. We will use cross-validation to find the optimal value of \(\alpha\) that balanced the bias-variance tradeoff.

Cost Complexity Pruning

As we discussed in class, this method is prone to overfit. Hence the tree_spam fit plotted in Figure 1 is likely too bushy. The function cv.tree() preforms cross-valudation in order to determine the optimal value for \(\alpha\) which in turn determines how much we need to prune back our tree (if at all)!

We use the argument FUN = prune.misclass in order to indicate that we want the classification error rate to guide the cross-validation and pruning process, rather than the default for the cv.tree() function, which is deviance

We can either use classification error rate or deviance to guide the cross-validation pruning process. The latter is achieved, by using the argument FUN = prune.misclass, the foremer is the default. The cv.tree() function reports the number of terminal nodes of each tree considered (size) as well as the corresponding error rate and the value of the cost-complexity parameter used (k, which corresponds to \(\log(\alpha)\)).

set.seed(345) # set seed to make work reproducible
tree.cv.miscl <- cv.tree(tree_spam, FUN = prune.misclass) # default is 10-fold 
tree.cv.dev <- cv.tree(tree_spam) # default is 10-fold 
plot(tree.cv.miscl, type="b") # using misclassication rate

plot(tree.cv.dev, type="b")   # uses deviance

In both casese, the minimum cv-error is achieved at 16. This indicates we do not need to prune. PS, to obtain this answer grammatically rather than reading it off the plot (which can sometimes not be obvious, you could use:

tree.cv.dev$size[which.min(tree.cv.dev$dev)]
[1] 16

Test Error

We can now see how well tree performs on the test set using the predict() function.

yhat <- predict(tree_spam,  spam_te, type = "class") # 
head(yhat) # print the first 6 predictions
[1] spam spam spam spam spam spam
Levels: nonspam spam
tail(yhat)# print the last 6 predictions
[1] nonspam nonspam nonspam nonspam nonspam nonspam
Levels: nonspam spam

Note we use type = "class" to get a class label for each observation. Otherwise, the default will provide a matrix of probabilities..

head(predict(tree_spam,  spam_te))
       nonspam      spam
4  0.046979866 0.9530201
6  0.000000000 1.0000000
9  0.007827789 0.9921722
10 0.140540541 0.8594595
22 0.140540541 0.8594595
25 0.007827789 0.9921722
Warning

If you re-run the predict() function then you might get slightly different results, due to “ties” (e.g. if there are an even split between classes in a terminal node).

As we have done previously for classification problems, we could calculate various metrics from the confusion matrix:

confusion_matrix = table(spam_te$type, yhat) 
confusion_matrix
         yhat
          nonspam spam
  nonspam     761   70
  spam         60  490

For example, the misclassification, test error is:

\[ \frac{70+60}{1381} = \frac{130}{1381} = 0.0941347 \]

rpart’s Implementation: The cp Parameter

Instead of using the \(\alpha\) parameter as specified in Equation 1, rpart uses the complexity parameter cp. You can think of this as a scaled unitless version of \(\alpha\) that plays a similar role in the following equation:

\[ R_{ct}(T) = R(T) + \texttt{cp} \times \mid T \mid \times R(T_1) \] where \(R(T)\) is the same as defined before, and \(T_1\) is the tree with no splits. In this formulation,

  • cp=1 will result in a tree with no splits, since any split must reduce error by more than 100% of the root error (impossible).
  • Very small cp values allow many splits and may overfit.
  • Very large cp values restrict splitting and may underfit.

By default, rpart() sets cp = 0.01, meaning a split must improve the relative fit by at least 1% to be attempted. Hence, rather than growing a large tree and pruning it afterward, rpart uses this stopping rule during tree construction to prevent the tree from growing too busy.

We can instead use cross-validation to find the optimal cp — just as we would choose \(\alpha\) in cost–complexity pruning. This will first involve overwritting that default for cp and changing the default of minsplit (the minimum number of observations that must exist in a node in order for a split to be attempted), from 20 to 2.

set.seed(601334)
treeMax <- rpart(type ~ ., data = spam_tr, method = "class", minsplit = 2, cp = 0)

This will grow a really bushy tree. I will only plot the skeleton since it will be unreadable with text:

plot(treeMax, uniform=TRUE);

To see if pruning is needed let’s investigate the cross-validated error as a function of the complexity parameter cp. We can visualizes the cost–complexity pruning path for the fitted rpart model, showing how tree size and model error trade off across different cp values:

plotcp(treeMax)
Figure 6

Figure 6 plots:

  • X-val Relative Error (Y-axis): This represents the relative error obtained during cross-validation. Lower values indicate a better fit to the data.
  • Tree Size (Top Axis): The top axis indicates the number of splits in the tree. The more splits, the larger the tree. For example, a tree size of 3 corresponds to a tree with three splits (four terminal nodes).
  • Complexity Parameter (cp, X-axis): cp controls the complexity of the tree. Smaller values of cp allow more splits, resulting in a larger tree. As cp decreases from left to right, the tree becomes more complex.
  • Error Bars: The vertical lines through the circles represent one standard error above and below the relative error.
Code
cpOpt <- treeMax$cptable[which.min(treeMax$cptable[, 4]), 1]
treeOpt <- prune(treeMax, cp = cpOpt)
plot(treeOpt)
text(treeOpt, xpd = TRUE, cex = 0.8)

Although there are larger trees that obtain smaller errors (the optimal size tree is 59 terminal nodes), it is not reducing the error much more than say the the 65 terminal node model. Breiman (1984) suggested that in actual practice, it’s common to instead use the smallest tree within 1 standard error (SE) of the minimum CV error (this is called the 1-SE rule).

The error bars help to identify a “simpler” tree using the “1-SE rule”: the smallest tree with an error within one standard error of the minimum error.

  • Dotted Line: The dotted line indicates the minimum cross-validated relative error plus one standard error. This is used in the “1-SE rule” to choose a tree size that balances error minimization and simplicity.

By relaxing a little the condition of minimizing the generalization error by applying the “1-SE rule” of Breiman, we obtain a more parsimonious tree:

thres1SE <- sum(treeMax$cptable[which.min(treeMax$cptable[, 4]), 4:5])
cp1SE <- treeMax$cptable[min(which(treeMax$cptable[, 4] <= thres1SE)), 1]
tree1SE <- prune(treeMax, cp = cp1SE)
rpart.plot(tree1SE)
plot(tree1SE, uniform = TRUE)
text(tree1SE, xpd = TRUE, cex = 0.8)
Figure 7: The 1-SE pruned tree applied to the spam data
Figure 8: The 1-SE pruned tree applied to the spam data

The misclassication on the test set for this model is

yhat1SE <- predict(tree1SE,  spam_te, type = "class") # 
conf_mat1SE = table(spam_te$type, yhat) 
conf_mat1SE
         yhat
          nonspam spam
  nonspam     761   70
  spam         60  490

\[ \frac{70+60}{1381} = \frac{130}{1381} = 0.0941347 \]

Regression

Now let’s try this technique for regression

# Fit a regression tree (CART) with tree::
set.seed(311)
tree_oz <- tree(ozone ~ ., data = oz_tr)  # method chosen by response type (numeric)

# Quick look
summary(tree_oz)

Regression tree:
tree(formula = ozone ~ ., data = oz_tr)
Variables actually used in tree construction:
[1] "temp_elmonte" "day"          "month"       
Number of terminal nodes:  9 
Residual mean deviance:  9.482 = 1261 / 133 
Distribution of residuals:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  -8.25   -2.00   -0.25    0.00    1.75   10.00 
plot(tree_oz); text(tree_oz, cex = 0.8)

Cross-validation & pruning (cost–complexity) For regression, cv.tree() uses deviance = RSS.

set.seed(310)
oz_cv <- cv.tree(tree_oz)       # 10-fold CV by default; dev = RSS
plot(oz_cv, type = "b")

best_size <- oz_cv$size[which.min(oz_cv$dev)]
best_size
[1] 3

In this case, cv is suggesting that the optimal tree is the one with 3 terminal nodes. We can prune back the tree using:

# Prune to the CV-selected size and refit
tree_oz_pruned <- prune.tree(tree_oz, best = best_size)
plot(tree_oz_pruned); text(tree_oz_pruned, cex = 0.8)

Test performance

oz_pred_tree <- predict(tree_oz_pruned, newdata = oz_te)
tree_mse <- mean((oz_te$ozone - oz_pred_tree)^2)
tree_mse
[1] 34.95427

Let’s compare it to the one made with default rpart (cp = 0.01)

rpart_oz_max <- rpart(ozone ~ ., data = oz_tr)
plot(rpart_oz_max, uniform=TRUE); text(rpart_oz_max)

oz_pred_rpart <- predict(rpart_oz_max, newdata = oz_te)
rpart_mse <- mean((oz_te$ozone - oz_pred_rpart)^2)
rpart_mse
[1] 36.8671

So it seems the simpler model obtained by tree() with cross-complexity pruning achieves slightly lower MSE.

References

Breiman, Leo, Jerome H. Friedman, Richard A. Olshen, and Charles J. Stone. 1984. Classification and Regression Trees. Monterey, CA: Wadsworth; Brooks/Cole.
Ripley, Brian. 2023. Tree: Classification and Regression Trees. https://CRAN.R-project.org/package=tree.
Therneau, Terry, and Beth Atkinson. 2023. Rpart: Recursive Partitioning and Regression Trees. https://CRAN.R-project.org/package=rpart.

Footnotes

  1. sometimes “nonspam” is referred to as “ham”↩︎

  2. make sure you’ve installed and loaded the kernlab library first. Alternatively, you could run help(spam, package = "kernlab")↩︎

  3. By default the vertical positions of nodes are derived from their deviances. The uniform = TRUE plot option is not available for the tree() outputs.↩︎