Data 311: Machine Learning

Lecture 7: Assessing Classification Models

Dr. Irene Vrbik

University of British Columbia Okanagan

Introduction

  • Today we will be looking at different ways of assessing models in the classification setting.

  • Let’s first consider a binary response in which two types of errors can be made. For ease of explanation, let’s say we are trying to predict whether or not someone has a disease.

  • This is very easily understood in a classification table AKA confusion matrix1

Misclassification1 Rate

Predicted Yes Predicted No
Actual Yes \(a\) = True Positive \(b\) = False Negative
Actual No \(c\) = False Positive \(d\) = True Negative

\[ \text{Misclassification Rate} = \frac{b+c}{a + b + c+d} = \frac{b+c}{n} \]

Classification Accuracy

Predicted Yes Predicted No
Actual Yes \(a\) = True Positive \(b\) = False Negative
Actual No \(c\) = False Positive \(d\) = True Negative

\[ \text{Classification Accuracy} = \frac{a+d}{a + b + c+d} = \frac{a+d}{n} \]

Classification Performance

Binary

\[ \begin{gather} \text{Misclassification rate + Classification accuracy}\\ =\frac{b+c}{n} + \frac{a+d}{n} = \frac{n}{n} = 1 \end{gather} \]

While these can easily be extended to multi-class (>2 groups) scenarios, you may run into issues in terms of interpretability …

Unbalanced Example

Is a model that provides 0.9 classification accuracy good?

Predicted
Yes No
Actual Yes 87 1
Actual No 10 2

Classification accuracy =\(\frac{89}{100}\)

  • Only 2 of the 12 that were classified to the “no” group were correct
  • The “no error rate” is 10/12, or 83%

Null Classifier

Predicted
Yes No
Actual Yes 9 0
Actual No 1 0

Classification accuracy =\(\frac{9}{10}\)

  • Consider this null classifier that assigns everyone to the “Yes” group
  • While it has high classification accuracy it would not be very helpful!

Problems with misclassification rate

  • For unbalanced1 (AKA imbalanced) classes, high classification accuracy (equivalently, low misclassification rates) can be deceiving.

  • While these measures are easily extended to multi-class scenarios, the problem with unbalanced classes remains.

Question: Is a 0.98 classification accuracy inherently “better” than 0.82?

Example 1

Classifier 1

Predicted
Yes No
Actual Yes 87 1
Actual No 1 11

Classification accuracy =\(\frac{98}{100}\)

Classifier 2

Predicted
Yes No
Actual Yes 79 9
Actual No 9 3

Classification accuracy =\(\frac{82}{100}\)

In this case yes! Classifier 1 \(>\) Classifier 2

Example 2

Classifier 1

Predicted
Yes No
Actual Yes 88 0
Actual No 2 10

Classification accuracy =\(\frac{98}{100}\)

Classifier 2

Predicted
Yes No
Actual Yes 70 18
Actual No 1 12

Classification accuracy =\(\frac{82}{101}\)

It depends…

Example 2 Preference

If its very important to correctly classify NOs we may prefer Classifier 2 over Classifier 1

  • eg trying to identify email as spam

If its very important to correctly classify YESs we may prefer Classifier 1 over Classifier 2

  • eg identifying people who a deadly disease that requires treatment

Aside

These concerns occur for continuous responses as well!

  • By minimizing the MSE (or RSS) we attempt to avoid ANY large mistakes in prediction as best we can.

  • In some scenarios, one might prefer to minimize the mean-absolute-error (MAE), which could choose a model that makes fewer small mistakes and more big ones!

  • That choice is, of course, application dependent and involve other (mathematically/statistically) considerations.

Back to classification …

  • We may also consider weighted metrics wherein the classes are weighted differently in order to emphasize more “cost” to misclassification of one group over the other.

  • Some software1 may be adjusted to incorporate these weights during the actual model-fitting process as well.

  • Another common approach is to consider metrics that may provide more insight than simply accuracy and misclassification rates alone.

Primary Measures

Precision

Predicted
Yes No
Actual Yes a b
Actual No c d

\[ \text{Precision} = \frac{a}{a+c} \]

The proportion of predicted “Yes”s, that are actually “Yes”

Higher precision suggests that the model is good at avoiding false positives. It focuses on making accurate positive predictions (i.e. the cost of false positives are relatively low or manageable.)

Recall/Sensitivity/TP Rate

Predicted
Yes No
Actual Yes a b
Actual No c d

\[ \text{Recall} = \frac{a}{a+b} \]

The proportion of actual “Yes”s that were predicted “Yes”.

Higher recall suggests that the model is good at avoiding false negatives.

Specificity (TN rate)

Predicted
Yes No
Actual Yes a b
Actual No c d

\[ \text{Specificity} = \frac{d}{c+d} \]

The proportion of actual “No”s that were predicted “No”

Specificity focuses on avoiding false positives, and is concerned with correctly identifying all negative instances.

Which to minimize?

  • So of the primary measures, if, say, it were costly to miss out on identifying a TRUE case, we might focus on maximizing Recall/Sensitivity while paying “less” attention to Specificity.

  • If it were costly to miss out on identifying FALSE cases, then we might focus on the opposite (maximizing Specificity).

  • In some cases, we might do exactly the above. But in most cases, it will still be useful to have a single measure that is not quite as flawed as (mis)class rates

Secondary Measures

  • Secondary Measures are ways to combine the information from multiple primary measures.

  • Let’s look at a motivating example …

Example 3

Consider a classification task with the following results:

Predicted
Yes No
Actual Yes 10 0
Actual No 90 0

\[ \begin{align} \text{Precision} &= \frac{10}{10+90} = 0.1\\ \text{Recall} &= \frac{10}{10} = 1.0 \end{align} \]

Example 3: Primary Measures

  • Let’s consider aggregating the information summarized in precision and recall by taking a simply average:
    \[ \begin{align} \frac{\text{Precision} + \text{Recall}}{2} = \frac{0.1+1}{2} = 0.55 \end{align} \]

  • 0.55 seems too good for a trivial classifier guessing only the minority class!

F1

A popular secondary measure is the F1 score. It is the harmonic mean1 between Precision and Recall.

\[ \text{F1} = \dfrac{2 \times \text{Precision} \times \text{Recall}} {\text{Precision} + \text{Recall}} \]

Lots of work has shown that F1 is more reliable than mis/classification rates for summarizing performance on unbalanced data sets.

Example 3: F1 score

  • We can view F1 as a “badness” measure for our classifier.

  • Example 3’s score is \(\text{F1} = \dfrac{2\times 0.1 \times 1}{1.1} = 0.182\)

  • For multiclass problems, F1 (and other secondary measures) are usually computed for each class (and sometimes then averaged, or summarized via some other measure)

Formula Summary

Predicted
Yes No
Yes TP (87) FN (1)
No FP (1) TN (11)

rows indicate actual Y/N

\[\begin{align} \text{Precision} &= \frac{\text{TP}}{\text{TP} + \text{FP}} \\\text{Recall} & = \frac{\text{TP}}{\text{TP} + \text{FN}} \\ \text{Specificity} & = \frac{\text{TN}}{\text{FP} + \text{TN}} \end{align}\]
\[\begin{align} \text{Accuracy} &= \frac{\text{TP} + \text{TN}}{\text{TP} + \text{FP} + \text{TN} + \text{FN}} \\[0.5em] \text{F1} &= \frac{2\times \text{Precision}\times \text{Recall}}{\text{Precision} + \text{Recall}} \end{align}\]

Example 1: Classifier 1

Predicted
Yes No
Yes TP (87) FN (1)
No FP (1) TN (11)
\[\begin{align} \text{Precision} & = \frac{87}{87+1} = 0.989\\ \text{Recall} &= \frac{87}{87+1} = 0.989 \\ \text{Specificity} &= \frac{11}{1+11} = 0.917 \\ \end{align}\]
\[\begin{align} \text{Accuracy} &= \frac{87+1}{87+1+11+ 1} = 0.98\\[0.5em] \text{F1} &= \frac{2(0.989)( 0.989)}{0.989 + 0.989} = 0.989 \end{align}\]

Example 1: Classifier 2

Predicted
Yes No
Yes TP (79) FN (9)
No FP (9) TN (3)
\[\begin{align} \text{Precision} & = \frac{79}{79+9} = 0.898\\ \text{Recall} &= \frac{79}{79+9} = 0.898 \\ \text{Specificity} &= \frac{3}{9+3} = 0.25 \\ \end{align}\]
\[\begin{align} \text{Accuracy} &= \frac{79+9}{79+9+3+ 9} = 0.82\\[0.5em] \text{F1} &= \frac{2(0.898)( 0.898)}{0.898 + 0.898} = 0.898 \end{align}\]

Example 2: Classifier 1

Predicted
Yes No
Yes TP (87) FN (1)
No FP (2) TN (10)
\[\begin{align} \text{Precision} & = \frac{87}{87+2} = 0.978\\ \text{Recall} &= \frac{87}{87+1} = 0.989 \\ \text{Specificity} &= \frac{10}{2+10} = 0.833 \end{align}\]
\[\begin{align} \text{Accuracy} &= \frac{87+1}{87+2+10+ 1} = 0.97\\[0.5em] \text{F1} &= \frac{2(0.978)( 0.989)}{0.978 + 0.989} = 0.9834692 \end{align}\]

Example 2: Classifier 2

Predicted
Yes No
Yes TP (70) FN (18)
No FP (1) TN (12)
\[\begin{align} \text{Precision} & = \frac{70}{70+1} = 0.986\\ \text{Recall} &= \frac{70}{70+18} = 0.795 \\ \text{Specificity} &= \frac{12}{1+12} = 0.923 \end{align}\]
\[\begin{align} \text{Accuracy} &= \frac{70+18}{70+1+12+ 18} = 0.812\\[0.5em] \text{F1} &= \frac{2(0.986)( 0.795)}{0.986 + 0.795} = 0.8802583 \end{align}\]

Example 2: Comparison of Classifiers

Metric Classifier 1 Classifier 2
Precision wins (0.978 vs 0.986)
Recall wins (0.989 vs 0.795)
Specificity wins (0.833 vs 0.923)
Accuracy wins (0.970 vs 0.812)
F1 wins (0.983 vs 0.880)

When to use what?1

Recall is good to use when the “cost” of a FN is high.

  • eg. “High priority” tickets - we’d rather get some extra false positives (false alarms) and move them to med/low priority. Hence FP are far more acceptable than a FN.

Precision is good to use when the “cost” of a FP is high

  • eg. Spam detection - we want to be very confident when we are labeling something as spam. We’d rather get FN than FP.

High Precision/Recall, Low Specificity

Specificity is good to use when the “cost” of a FP is high AND you want to capture all true negatives.

Null Classifier

Predicted
Yes No
Yes TP (9) FN (0)
No FP (1) TN (0)
\[\begin{align} \text{Precision} & = \frac{9}{9+1} = 0.9\\ \text{Recall} &= \frac{9}{9+0} = 1 \\ \text{Specificity} &= \frac{0}{1+0} = 0 \end{align}\]

All positive examples are predicted as positive. But none of the negative examples is predicted as negative.

Take home message

When to use classification accuracy over F1:

  • if True Positives and True negatives are more important
  • if classes are balanced

When to use F1-score over accuracy:

  • if False Negatives and False Positives are crucial
  • if we seek a balance between Precision and Recall
  • if classes are unbalanced1

Using probabilities

  • Most classification models provide probabilistic responses for each class, which can be incorporated into useful metrics

  • Notation-wise, lets use \(z_{ig}\) to correspond to the probability that observation \(i\) belongs to group/class \(g\)

  • So for each observation \(i\) would would have a vector

    \[ z_i = (z_{i1}, z_{i2}, \dots, z_{iG}) \]

    where \(G\) denotes the total number of groups.

Example Using probabilities

If observation 1 belongs to class 3 we might have:

  • Model 1: \(z_1\) = (0.10, 0.30, 0.60)
  • Model 2: \(z_1\) = (0.05, 0.05, 0.90)

Which model is better?

The classification accuracy would be the same but Model 2 might be deemed better since it is more certain about that correct classification.

Logloss

  • One popular misclassification measure, which is also easily defined in this multi-class scenario, is logloss.

  • Logloss is defined by \[-\frac{1}{n} \sum_{i=1}^{n} \sum_{g=1}^G I(y_i = g) \log z_{ig}\]

Logloss Example

So for our previous example…where obs 1 belongs to class 3

Model zig logloss
1 \(z_1\) = (0.10, 0.30, 0.60) \(- \log(0.60) = 0.511\)
2 \(z_1\) = (0.05, 0.05, 0.90) \(- \log(0.90) = 0.105\)
3 \(z_1\) = (0.55, 0.44, 0.01) \(- \log(0.01) = 4.605\)

Logloss properties

  • From the example it’s clear that logloss does not have simple upper bound.

  • In fact, since -log(0) = Inf, it is technically unbounded by above. The lower bound (perfect probabilistic classifier), would be log(1) = 0 for each observation.

  • This means that even just one highly confident misclassification is heavily penalized by this metric.

Final note

  • For measuring classification performance, it is generally good practice to report the classification table alongside any chosen metrics

  • This can help flag any strange results that might be missed by looking at just summary metrics.