Data 311: Machine Learning

Lecture 7: Assessing Classification Models

Dr. Irene Vrbik

University of British Columbia Okanagan

Introduction

Today we will be looking at different ways of assessing models in the classification setting.
Let’s first consider a binary response in which two types of errors can be made. For ease of explanation, let’s say we are trying to predict whether or not someone has a disease.
This is very easily understood in a classification table AKA confusion matrix¹

Misclassification¹ Rate

	Predicted Yes	Predicted No
Actual Yes	\(a\) = True Positive	\(b\) = False Negative
Actual No	\(c\) = False Positive	\(d\) = True Negative

\[ \text{Misclassification Rate} = \frac{b+c}{a + b + c+d} = \frac{b+c}{n} \]

Classification Accuracy

	Predicted Yes	Predicted No
Actual Yes	\(a\) = True Positive	\(b\) = False Negative
Actual No	\(c\) = False Positive	\(d\) = True Negative

\[ \text{Classification Accuracy} = \frac{a+d}{a + b + c+d} = \frac{a+d}{n} \]

Classification Performance

Binary

\[ \begin{gather} \text{Misclassification rate + Classification accuracy}\\ =\frac{b+c}{n} + \frac{a+d}{n} = \frac{n}{n} = 1 \end{gather} \]

While these can easily be extended to multi-class (>2 groups) scenarios, you may run into issues in terms of interpretability …

Unbalanced Example

Is a model that provides 0.9 classification accuracy good?

	Predicted
	Yes	No
Actual Yes	87	1
Actual No	10	2

Classification accuracy =\(\frac{89}{100}\)

Only 2 of the 12 that were classified to the “no” group were correct
The “no error rate” is 10/12, or 83%

Null Classifier

	Predicted
	Yes	No
Actual Yes	9	0
Actual No	1	0

Classification accuracy =\(\frac{9}{10}\)

Consider this null classifier that assigns everyone to the “Yes” group
While it has high classification accuracy it would not be very helpful!

Problems with misclassification rate

For unbalanced¹ (AKA imbalanced) classes, high classification accuracy (equivalently, low misclassification rates) can be deceiving.
While these measures are easily extended to multi-class scenarios, the problem with unbalanced classes remains.

Question: Is a 0.98 classification accuracy inherently “better” than 0.82?

Example 1

Classifier 1

	Predicted
	Yes	No
Actual Yes	87	1
Actual No	1	11

Classification accuracy =\(\frac{98}{100}\)

Classifier 2

	Predicted
	Yes	No
Actual Yes	79	9
Actual No	9	3

Classification accuracy =\(\frac{82}{100}\)

In this case yes! Classifier 1 \(>\) Classifier 2

Example 2

Classifier 1

	Predicted
	Yes	No
Actual Yes	88	0
Actual No	2	10

Classification accuracy =\(\frac{98}{100}\)

Classifier 2

	Predicted
	Yes	No
Actual Yes	70	18
Actual No	1	12

Classification accuracy =\(\frac{82}{101}\)

It depends…

Example 2 Preference

If its very important to correctly classify NOs we may prefer Classifier 2 over Classifier 1

eg trying to identify email as spam

If its very important to correctly classify YESs we may prefer Classifier 1 over Classifier 2

eg identifying people who a deadly disease that requires treatment

Aside

These concerns occur for continuous responses as well!

By minimizing the MSE (or RSS) we attempt to avoid ANY large mistakes in prediction as best we can.
In some scenarios, one might prefer to minimize the mean-absolute-error (MAE), which could choose a model that makes fewer small mistakes and more big ones!
That choice is, of course, application dependent and involve other (mathematically/statistically) considerations.

Back to classification …

We may also consider weighted metrics wherein the classes are weighted differently in order to emphasize more “cost” to misclassification of one group over the other.
Some software¹ may be adjusted to incorporate these weights during the actual model-fitting process as well.
Another common approach is to consider metrics that may provide more insight than simply accuracy and misclassification rates alone.

Primary Measures

Precision

	Predicted
	Yes	No
Actual Yes	a	b
Actual No	c	d

\[ \text{Precision} = \frac{a}{a+c} \]

The proportion of predicted “Yes”s, that are actually “Yes”

Higher precision suggests that the model is good at avoiding false positives. It focuses on making accurate positive predictions (i.e. the cost of false positives are relatively low or manageable.)

Recall/Sensitivity/TP Rate

	Predicted
	Yes	No
Actual Yes	a	b
Actual No	c	d

\[ \text{Recall} = \frac{a}{a+b} \]

The proportion of actual “Yes”s that were predicted “Yes”.

Higher recall suggests that the model is good at avoiding false negatives.

Specificity (TN rate)

	Predicted
	Yes	No
Actual Yes	a	b
Actual No	c	d

\[ \text{Specificity} = \frac{d}{c+d} \]

The proportion of actual “No”s that were predicted “No”

Specificity focuses on avoiding false positives, and is concerned with correctly identifying all negative instances.

Consequences:

False Positives: Treating individuals who are not infected with the virus as if they were infected would result in significant emotional distress, financial costs, and potential side effects from unnecessary treatments or quarantine.
False Negatives: Missing an infected individual (false negatives) is also undesirable but may be less costly because they can be quarantined or treated separately if they show symptoms.

Definition: Specificity measures the ability of a classification model to correctly identify all negative instances (true negatives).

It answers the question: “Of all the actual negative instances, how many were correctly predicted as negative?”

quantifies the model’s ability to correctly identify all negative instances.

Use Cases: Specificity is relevant when false positives are costly or undesirable, especially when the negative class represents critical outcomes.

For example, in a disease diagnostic test, high specificity is essential to avoid unnecessary treatment for healthy individuals.

Context: Precision is often emphasized when the consequences of missing positive instances are significant. Specificity is emphasized when the consequences of false positives (misclassifying negatives as positives) are significant.

Example: Medical Testing for a Rare Disease

Which to minimize?

So of the primary measures, if, say, it were costly to miss out on identifying a TRUE case, we might focus on maximizing Recall/Sensitivity while paying “less” attention to Specificity.
If it were costly to miss out on identifying FALSE cases, then we might focus on the opposite (maximizing Specificity).
In some cases, we might do exactly the above. But in most cases, it will still be useful to have a single measure that is not quite as flawed as (mis)class rates

Secondary Measures

Secondary Measures are ways to combine the information from multiple primary measures.
Let’s look at a motivating example …

Example 3

Consider a classification task with the following results:

	Predicted
	Yes	No
Actual Yes	10	0
Actual No	90	0

\[ \begin{align} \text{Precision} &= \frac{10}{10+90} = 0.1\\ \text{Recall} &= \frac{10}{10} = 1.0 \end{align} \]

Example 3: Primary Measures

Let’s consider aggregating the information summarized in precision and recall by taking a simply average:
\[ \begin{align} \frac{\text{Precision} + \text{Recall}}{2} = \frac{0.1+1}{2} = 0.55 \end{align} \]
0.55 seems too good for a trivial classifier guessing only the minority class!

F1

A popular secondary measure is the F1 score. It is the harmonic mean¹ between Precision and Recall.

\[ \text{F1} = \dfrac{2 \times \text{Precision} \times \text{Recall}} {\text{Precision} + \text{Recall}} \]

Lots of work has shown that F1 is more reliable than mis/classification rates for summarizing performance on unbalanced data sets.

Example 3: F1 score

We can view F1 as a “badness” measure for our classifier.
Example 3’s score is \(\text{F1} = \dfrac{2\times 0.1 \times 1}{1.1} = 0.182\)
For multiclass problems, F1 (and other secondary measures) are usually computed for each class (and sometimes then averaged, or summarized via some other measure)

Formula Summary

	Predicted
	Yes	No
Yes	TP (87)	FN (1)
No	FP (1)	TN (11)

rows indicate actual Y/N

\[\begin{align} \text{Precision} &= \frac{\text{TP}}{\text{TP} + \text{FP}} \\\text{Recall} & = \frac{\text{TP}}{\text{TP} + \text{FN}} \\ \text{Specificity} & = \frac{\text{TN}}{\text{FP} + \text{TN}} \end{align}\]

\[\begin{align} \text{Accuracy} &= \frac{\text{TP} + \text{TN}}{\text{TP} + \text{FP} + \text{TN} + \text{FN}} \\[0.5em] \text{F1} &= \frac{2\times \text{Precision}\times \text{Recall}}{\text{Precision} + \text{Recall}} \end{align}\]

Example 1: Classifier 1

	Predicted
	Yes	No
Yes	TP (87)	FN (1)
No	FP (1)	TN (11)

\[\begin{align} \text{Precision} & = \frac{87}{87+1} = 0.989\\ \text{Recall} &= \frac{87}{87+1} = 0.989 \\ \text{Specificity} &= \frac{11}{1+11} = 0.917 \\ \end{align}\]

\[\begin{align} \text{Accuracy} &= \frac{87+1}{87+1+11+ 1} = 0.98\\[0.5em] \text{F1} &= \frac{2(0.989)( 0.989)}{0.989 + 0.989} = 0.989 \end{align}\]

Example 1: Classifier 2

	Predicted
	Yes	No
Yes	TP (79)	FN (9)
No	FP (9)	TN (3)

\[\begin{align} \text{Precision} & = \frac{79}{79+9} = 0.898\\ \text{Recall} &= \frac{79}{79+9} = 0.898 \\ \text{Specificity} &= \frac{3}{9+3} = 0.25 \\ \end{align}\]

\[\begin{align} \text{Accuracy} &= \frac{79+9}{79+9+3+ 9} = 0.82\\[0.5em] \text{F1} &= \frac{2(0.898)( 0.898)}{0.898 + 0.898} = 0.898 \end{align}\]

Example 2: Classifier 1

	Predicted
	Yes	No
Yes	TP (87)	FN (1)
No	FP (2)	TN (10)

\[\begin{align} \text{Precision} & = \frac{87}{87+2} = 0.978\\ \text{Recall} &= \frac{87}{87+1} = 0.989 \\ \text{Specificity} &= \frac{10}{2+10} = 0.833 \end{align}\]

\[\begin{align} \text{Accuracy} &= \frac{87+1}{87+2+10+ 1} = 0.97\\[0.5em] \text{F1} &= \frac{2(0.978)( 0.989)}{0.978 + 0.989} = 0.9834692 \end{align}\]

Example 2: Classifier 2

	Predicted
	Yes	No
Yes	TP (70)	FN (18)
No	FP (1)	TN (12)

\[\begin{align} \text{Precision} & = \frac{70}{70+1} = 0.986\\ \text{Recall} &= \frac{70}{70+18} = 0.795 \\ \text{Specificity} &= \frac{12}{1+12} = 0.923 \end{align}\]

\[\begin{align} \text{Accuracy} &= \frac{70+18}{70+1+12+ 18} = 0.812\\[0.5em] \text{F1} &= \frac{2(0.986)( 0.795)}{0.986 + 0.795} = 0.8802583 \end{align}\]

Example 2: Comparison of Classifiers

Metric	Classifier 1	Classifier 2
Precision		wins (0.978 vs 0.986)
Recall	wins (0.989 vs 0.795)
Specificity		wins (0.833 vs 0.923)
Accuracy	wins (0.970 vs 0.812)
F1	wins (0.983 vs 0.880)

When to use what?¹

Recall is good to use when the “cost” of a FN is high.

eg. “High priority” tickets - we’d rather get some extra false positives (false alarms) and move them to med/low priority. Hence FP are far more acceptable than a FN.

Precision is good to use when the “cost” of a FP is high

eg. Spam detection - we want to be very confident when we are labeling something as spam. We’d rather get FN than FP.

High Precision/Recall, Low Specificity

Specificity is good to use when the “cost” of a FP is high AND you want to capture all true negatives.

Null Classifier

	Predicted
	Yes	No
Yes	TP (9)	FN (0)
No	FP (1)	TN (0)

\[\begin{align} \text{Precision} & = \frac{9}{9+1} = 0.9\\ \text{Recall} &= \frac{9}{9+0} = 1 \\ \text{Specificity} &= \frac{0}{1+0} = 0 \end{align}\]

All positive examples are predicted as positive. But none of the negative examples is predicted as negative.

Take home message

When to use classification accuracy over F1:

if True Positives and True negatives are more important
if classes are balanced

When to use F1-score over accuracy:

if False Negatives and False Positives are crucial
if we seek a balance between Precision and Recall
if classes are unbalanced¹

Using probabilities

Most classification models provide probabilistic responses for each class, which can be incorporated into useful metrics
Notation-wise, lets use \(z_{ig}\) to correspond to the probability that observation \(i\) belongs to group/class \(g\)
So for each observation \(i\) would would have a vector

\[ z_i = (z_{i1}, z_{i2}, \dots, z_{iG}) \]

where \(G\) denotes the total number of groups.

Example Using probabilities

If observation 1 belongs to class 3 we might have:

Model 1: \(z_1\) = (0.10, 0.30, 0.60)
Model 2: \(z_1\) = (0.05, 0.05, 0.90)

Which model is better?

The classification accuracy would be the same but Model 2 might be deemed better since it is more certain about that correct classification.

Logloss

One popular misclassification measure, which is also easily defined in this multi-class scenario, is logloss.
Logloss is defined by \[-\frac{1}{n} \sum_{i=1}^{n} \sum_{g=1}^G I(y_i = g) \log z_{ig}\]

Logloss Example

So for our previous example…where obs 1 belongs to class 3

Model	zig	logloss
1	\(z_1\) = (0.10, 0.30, 0.60)	\(- \log(0.60) = 0.511\)
2	\(z_1\) = (0.05, 0.05, 0.90)	\(- \log(0.90) = 0.105\)
3	\(z_1\) = (0.55, 0.44, 0.01)	\(- \log(0.01) = 4.605\)

Logloss properties

From the example it’s clear that logloss does not have simple upper bound.
In fact, since -log(0) = Inf, it is technically unbounded by above. The lower bound (perfect probabilistic classifier), would be log(1) = 0 for each observation.
This means that even just one highly confident misclassification is heavily penalized by this metric.

Final note

For measuring classification performance, it is generally good practice to report the classification table alongside any chosen metrics
This can help flag any strange results that might be missed by looking at just summary metrics.

Data 311: Machine Learning

Introduction

Misclassification1 Rate

Classification Accuracy

Classification Performance

Binary

Unbalanced Example

Null Classifier

Problems with misclassification rate

Example 1

Example 2

Example 2 Preference

Aside

Back to classification …

Primary Measures

Precision

Recall/Sensitivity/TP Rate

Specificity (TN rate)

Which to minimize?

Secondary Measures

Example 3

Example 3: Primary Measures

F1

Example 3: F1 score

Formula Summary

Example 1: Classifier 1

Example 1: Classifier 2

Example 2: Classifier 1

Example 2: Classifier 2

Example 2: Comparison of Classifiers

When to use what?1

High Precision/Recall, Low Specificity

Null Classifier

Take home message

Using probabilities

Example Using probabilities

Logloss

Logloss Example

Logloss properties

Final note

Misclassification¹ Rate

When to use what?¹