How and Why to Compute a Balanced Precision Score

This is a translation of my original article on habr.com.

Precision score is one of the metrics used to evaluate the performance of a binary classifier in machine learning. It has long proven its usefulness: it is a single, easy-to-compute, and interpretable number that characterizes the quality of a model.

A verbal definition of precision score is:

“Of all the examples predicted as positive, what fraction are actually positive?”

For example, if a disease test has a precision score of 0.95, then 95% of people who receive a positive test result are truly ill, while the remaining 5% are actually healthy.

Now to the main point of this article. In the previous paragraph, I deliberately omitted a critically important detail needed to assess the quality of the test: I did not specify the dataset on which the precision score was calculated.

The point is that precision score is not an inherent property of a classifier - it is measured on a dataset and can change when the dataset changes. Because of this, the statement “the precision score of this disease test is 0.95,” without any information about the data used to compute it, does not fully describe the property of the classifier that we are actually interested in. In some cases, it can even lead to incorrect conclusions.

How does that happen? And what do I propose instead? Let me explain.

The short answer: the culprit is class imbalance. By changing the class distribution, you can significantly increase or decrease the precision score without changing the classifier itself.

This problem can be addressed either by abandoning precision score in favor of the confusion matrix - which is not always convenient, since it consists of four numbers rather than one - or by computing a balanced precision score, which is what this article is about.

Precision Score and Class Balance

Precision score depends on the values of True Positive (TP) and False Positive (FP):

$$\text{Precision Score}=\frac{TP}{TP + FP}$$

These quantities are typically represented in a confusion matrix:

	Predicted Positive	Predicted Negative
Actual Positive	True Positive (TP)	False Negative (FN)
Actual Negative	False Positive (FP)	True Negative (TN)

As FP increases, precision score decreases, and vice versa.

It is easy to see that the absolute value of FP depends on the class distribution. If there are more negative examples, FP will generally be larger for the same classifier. By changing the class balance, you can drive the precision score all the way to 0 (by making (TP = 0), i.e., removing all positive examples) or all the way to 1 (by making (FP = 0), i.e., removing all negative examples).

When a model is evaluated on data drawn from the real-world distribution it will encounter in practice, and the goal is to estimate the actual fraction of false positives among positive predictions, the standard precision score works perfectly well.

However, there are situations where it can lead to misleading conclusions.

1. Precision score increased after training on data with positive-class upsampling or negative-class downsampling. Does that mean the final model should always be trained this way?

Not necessarily.

The metric may have improved simply because you increased the number of positive examples, even though the classifier itself did not become any better. Just look at the precision score formula again.

2. During training, one of the classes was upsampled or downsampled. The resulting metrics, including precision score, look good. Is the model ready for production?

The same issue applies here.

In production, the proportion of negative examples may be much higher. That alone can increase FP and, consequently, reduce the precision score.

3. “A low precision score means poor classification quality.”

Not always.

The actual false positive rate, $ \frac{\text{False Positives}}{\text{Actual Negatives}}, $, may be very low. Yet if the negative class is much larger than the positive class, the absolute number of false positives can still be high enough to substantially reduce the precision score.

Perhaps there is a way to reduce the number of negative examples reaching the classifier during inference. In that case, the exact same model may perform quite well.

Balanced Precision Score

These kinds of mistakes are much less likely when you examine the confusion matrix.

A confusion matrix contains more information than a single precision score, making it easier to notice important characteristics of a classifier's behavior.

However, if using a confusion matrix is impractical and you need a quick way to compare models trained for the same task on datasets with different class distributions, then directly comparing their precision scores is not valid because those scores are affected by class balance.

Instead, precision score should be computed on data with the same class distribution. For example, you can assume a perfectly balanced dataset. The resulting metric is what I call the balanced precision score.

This approach removes the effect of unequal class frequencies by giving equal weight to positive and negative examples when computing precision score.

Of course, the resulting metric no longer measures precision under the real-world class distribution.

What it does provide is a way to compare classifiers independently of class balance, as well as an additional perspective on classification quality. Recall the earlier example where precision score was low despite a low false positive rate, simply because the negative class dominated the dataset.

One way to compute balanced precision score would be to rebalance the dataset through upsampling or downsampling and then calculate the ordinary precision score.

But that is unnecessary.

Instead, use the following formula:

$$\text{Balanced Precision Score}=\frac{k_1 TP} {k_1 TP + k_2 FP}$$

Where:

$$k_1 \frac{0.5}{TP + FN}=\frac{0.5}{\text{Actual Positive}}$$

$$ k_2 \frac{0.5}{TN + FP}=\frac{0.5}{\text{Actual Negative}}$$

If you use Python's sklearn library, the precision_score function already supports sample weighting through the sample_weight parameter:

import numpy as np
from sklearn.metrics import precision_score

# y_true, y_pred, ...

# class probabilities:
# np.array([0.9092, 0.0908])

sample_coeff = 0.5 / true_class_dist

weights = np.zeros(len(y_true))
weights[y_true == 1] = sample_coeff[1]
weights[y_true == 0] = sample_coeff[0]

precision_score(y_true, y_pred, sample_weight=weights)  # balanced precision score

If you are not using sklearn, you can compute the balanced precision score directly from the formula above.

Example

Consider the same classifier evaluated on two datasets that differ only in class balance. Let us compare the ordinary precision score and the balanced precision score.

The first confusion matrix (highly imbalanced data):

	Predicted Positive	Predicted Negative	Total
Actual Positive	90 (TP)	10 (FN)	100
Actual Negative	50 (FP)	850 (TN)	900

The second confusion matrix (more balanced data):

	Predicted Positive	Predicted Negative	Total
Actual Positive	90 (TP)	10 (FN)	100
Actual Negative	10 (FP)	170 (TN)	180

Using the standard precision score formula, we obtain:

First dataset: (0.64)
Second dataset: (0.90)

Now compute the balanced precision score.

In both cases, it is equal to $0.94$.

As expected, the balanced metric produces the same result because the classifier's behavior is identical; only the class distribution differs.

I hope you find this approach useful - and perhaps identify some of its limitations as well. I would be happy to hear your thoughts on this!

How and Why to Compute a Balanced Precision Score

Precision Score and Class Balance

1. Precision score increased after training on data with positive-class upsampling or negative-class downsampling. Does that mean the final model should always be trained this way?

2. During training, one of the classes was upsampled or downsampled. The resulting metrics, including precision score, look good. Is the model ready for production?

3. “A low precision score means poor classification quality.”

Balanced Precision Score

Example

Comments

More from this blog

Automatically Extracting Piecewise-Linear Trends from a Time Series

CADE — An Interesting Approach to Finding Anomalies in Multidimensional Data

What Is the Distribution of Sample Quantiles?

Command Palette

Precision Score and Class Balance

1. Precision score increased after training on data with positive-class upsampling or negative-class downsampling. Does that mean the final model should always be trained this way?

2. During training, one of the classes was upsampled or downsampled. The resulting metrics, including precision score, look good. Is the model ready for production?

3. “A low precision score means poor classification quality.”

Balanced Precision Score

Example

Comments

More from this blog