Yes | No | |
---|---|---|
Actual Yes | 87 | 1 |
Actual No | 10 | 2 |
Lecture 7: Assessing Classification Models
University of British Columbia Okanagan
Today we will be looking at different ways of assessing models in the classification setting.
Let’s first consider a binary response in which two types of errors can be made. For ease of explanation, let’s say we are trying to predict whether or not someone has a disease.
This is very easily understood in a classification table AKA confusion matrix1
Predicted Yes | Predicted No | |
---|---|---|
Actual Yes | \(a\) = True Positive | \(b\) = False Negative |
Actual No | \(c\) = False Positive | \(d\) = True Negative |
\[ \text{Misclassification Rate} = \frac{b+c}{a + b + c+d} = \frac{b+c}{n} \]
Predicted Yes | Predicted No | |
---|---|---|
Actual Yes | \(a\) = True Positive | \(b\) = False Negative |
Actual No | \(c\) = False Positive | \(d\) = True Negative |
\[ \text{Classification Accuracy} = \frac{a+d}{a + b + c+d} = \frac{a+d}{n} \]
\[ \begin{gather} \text{Misclassification rate + Classification accuracy}\\ =\frac{b+c}{n} + \frac{a+d}{n} = \frac{n}{n} = 1 \end{gather} \]
While these can easily be extended to multi-class (>2 groups) scenarios, you may run into issues in terms of interpretability …
Is a model that provides 0.9 classification accuracy good?
Predicted
|
||
---|---|---|
Yes | No | |
Actual Yes | 87 | 1 |
Actual No | 10 | 2 |
Classification accuracy =\(\frac{89}{100}\)
Predicted
|
||
---|---|---|
Yes | No | |
Actual Yes | 9 | 0 |
Actual No | 1 | 0 |
Classification accuracy =\(\frac{9}{10}\)
For unbalanced1 (AKA imbalanced) classes, high classification accuracy (equivalently, low misclassification rates) can be deceiving.
While these measures are easily extended to multi-class scenarios, the problem with unbalanced classes remains.
Question: Is a 0.98 classification accuracy inherently “better” than 0.82?
Classifier 1
Predicted
|
||
---|---|---|
Yes | No | |
Actual Yes | 87 | 1 |
Actual No | 1 | 11 |
Classification accuracy =\(\frac{98}{100}\)
Classifier 2
Predicted
|
||
---|---|---|
Yes | No | |
Actual Yes | 79 | 9 |
Actual No | 9 | 3 |
Classification accuracy =\(\frac{82}{100}\)
In this case yes! Classifier 1 \(>\) Classifier 2
Classifier 1
Predicted
|
||
---|---|---|
Yes | No | |
Actual Yes | 88 | 0 |
Actual No | 2 | 10 |
Classification accuracy =\(\frac{98}{100}\)
Classifier 2
Predicted
|
||
---|---|---|
Yes | No | |
Actual Yes | 70 | 18 |
Actual No | 1 | 12 |
Classification accuracy =\(\frac{82}{101}\)
It depends…
If its very important to correctly classify NOs we may prefer Classifier 2 over Classifier 1
If its very important to correctly classify YESs we may prefer Classifier 1 over Classifier 2
These concerns occur for continuous responses as well!
By minimizing the MSE (or RSS) we attempt to avoid ANY large mistakes in prediction as best we can.
In some scenarios, one might prefer to minimize the mean-absolute-error (MAE), which could choose a model that makes fewer small mistakes and more big ones!
That choice is, of course, application dependent and involve other (mathematically/statistically) considerations.
We may also consider weighted metrics wherein the classes are weighted differently in order to emphasize more “cost” to misclassification of one group over the other.
Some software1 may be adjusted to incorporate these weights during the actual model-fitting process as well.
Another common approach is to consider metrics that may provide more insight than simply accuracy and misclassification rates alone.
Predicted
|
||
---|---|---|
Yes | No | |
Actual Yes | a | b |
Actual No | c | d |
\[ \text{Precision} = \frac{a}{a+c} \]
The proportion of predicted “Yes”s, that are actually “Yes”
Higher precision suggests that the model is good at avoiding false positives. It focuses on making accurate positive predictions (i.e. the cost of false positives are relatively low or manageable.)
Predicted
|
||
---|---|---|
Yes | No | |
Actual Yes | a | b |
Actual No | c | d |
\[ \text{Recall} = \frac{a}{a+b} \]
The proportion of actual “Yes”s that were predicted “Yes”.
Higher recall suggests that the model is good at avoiding false negatives.
Predicted
|
||
---|---|---|
Yes | No | |
Actual Yes | a | b |
Actual No | c | d |
\[ \text{Specificity} = \frac{d}{c+d} \]
The proportion of actual “No”s that were predicted “No”
Specificity focuses on avoiding false positives, and is concerned with correctly identifying all negative instances.
So of the primary measures, if, say, it were costly to miss out on identifying a TRUE case, we might focus on maximizing Recall/Sensitivity while paying “less” attention to Specificity.
If it were costly to miss out on identifying FALSE cases, then we might focus on the opposite (maximizing Specificity).
In some cases, we might do exactly the above. But in most cases, it will still be useful to have a single measure that is not quite as flawed as (mis)class rates
Secondary Measures are ways to combine the information from multiple primary measures.
Let’s look at a motivating example …
Consider a classification task with the following results:
Predicted
|
||
---|---|---|
Yes | No | |
Actual Yes | 10 | 0 |
Actual No | 90 | 0 |
\[ \begin{align} \text{Precision} &= \frac{10}{10+90} = 0.1\\ \text{Recall} &= \frac{10}{10} = 1.0 \end{align} \]
Let’s consider aggregating the information summarized in precision and recall by taking a simply average:
\[
\begin{align}
\frac{\text{Precision} + \text{Recall}}{2} =
\frac{0.1+1}{2} = 0.55
\end{align}
\]
0.55 seems too good for a trivial classifier guessing only the minority class!
A popular secondary measure is the F1 score. It is the harmonic mean1 between Precision and Recall.
\[ \text{F1} = \dfrac{2 \times \text{Precision} \times \text{Recall}} {\text{Precision} + \text{Recall}} \]
Lots of work has shown that F1 is more reliable than mis/classification rates for summarizing performance on unbalanced data sets.
We can view F1 as a “badness” measure for our classifier.
Example 3’s score is \(\text{F1} = \dfrac{2\times 0.1 \times 1}{1.1} = 0.182\)
For multiclass problems, F1 (and other secondary measures) are usually computed for each class (and sometimes then averaged, or summarized via some other measure)
Predicted
|
||
---|---|---|
Yes | No | |
Yes | TP (87) | FN (1) |
No | FP (1) | TN (11) |
rows indicate actual Y/N
Predicted
|
||
---|---|---|
Yes | No | |
Yes | TP (87) | FN (1) |
No | FP (1) | TN (11) |
Predicted
|
||
---|---|---|
Yes | No | |
Yes | TP (79) | FN (9) |
No | FP (9) | TN (3) |
Predicted
|
||
---|---|---|
Yes | No | |
Yes | TP (87) | FN (1) |
No | FP (2) | TN (10) |
Predicted
|
||
---|---|---|
Yes | No | |
Yes | TP (70) | FN (18) |
No | FP (1) | TN (12) |
Metric | Classifier 1 | Classifier 2 |
---|---|---|
Precision | wins (0.978 vs 0.986) | |
Recall | wins (0.989 vs 0.795) | |
Specificity | wins (0.833 vs 0.923) | |
Accuracy | wins (0.970 vs 0.812) | |
F1 | wins (0.983 vs 0.880) |
Recall is good to use when the “cost” of a FN is high.
Precision is good to use when the “cost” of a FP is high
Specificity is good to use when the “cost” of a FP is high AND you want to capture all true negatives.
Predicted
|
||
---|---|---|
Yes | No | |
Yes | TP (9) | FN (0) |
No | FP (1) | TN (0) |
All positive examples are predicted as positive. But none of the negative examples is predicted as negative.
When to use classification accuracy over F1:
When to use F1-score over accuracy:
Most classification models provide probabilistic responses for each class, which can be incorporated into useful metrics
Notation-wise, lets use \(z_{ig}\) to correspond to the probability that observation \(i\) belongs to group/class \(g\)
So for each observation \(i\) would would have a vector
\[ z_i = (z_{i1}, z_{i2}, \dots, z_{iG}) \]
where \(G\) denotes the total number of groups.
If observation 1 belongs to class 3 we might have:
Which model is better?
The classification accuracy would be the same but Model 2 might be deemed better since it is more certain about that correct classification.
One popular misclassification measure, which is also easily defined in this multi-class scenario, is logloss.
Logloss is defined by \[-\frac{1}{n} \sum_{i=1}^{n} \sum_{g=1}^G I(y_i = g) \log z_{ig}\]
So for our previous example…where obs 1 belongs to class 3
Model | zig | logloss |
---|---|---|
1 | \(z_1\) = (0.10, 0.30, 0.60) | \(- \log(0.60) = 0.511\) |
2 | \(z_1\) = (0.05, 0.05, 0.90) | \(- \log(0.90) = 0.105\) |
3 | \(z_1\) = (0.55, 0.44, 0.01) | \(- \log(0.01) = 4.605\) |
From the example it’s clear that logloss does not have simple upper bound.
In fact, since -log(0) = Inf, it is technically unbounded by above. The lower bound (perfect probabilistic classifier), would be log(1) = 0 for each observation.
This means that even just one highly confident misclassification is heavily penalized by this metric.
For measuring classification performance, it is generally good practice to report the classification table alongside any chosen metrics
This can help flag any strange results that might be missed by looking at just summary metrics.