Making sense of confusion matrices: ROC vs PR (precision-recall) and other metrics

Confusion matrices are simple in principle: four numbers that describe the performance of a binary classifier. Yet a full understanding of the behavior and meaning of a confusion matrix is far more subtle than it would appear based on the existence of just four numbers.

Image result for confusion matrix
This is a confusion matrix. Looks simple right?


The first metric we look at is accuracy. This is the fraction of all instances that are correctly labeled.

Accuracy = (TP + TN) / (TP + FP + FN + TN)

Accuracy seems like a simple metric, but it’s more subtle than it appears at first. For example, is a classifier with accuracy 95% good? It sounds good, but what if 95% of the instances are negative? Then an all-negative classifier achieves 95% accuracy. Maybe that’s good enough–but it’s something you have to think about with the accuracy metric.

An interesting thing about accuracy is that it is symmetric in the labels (positive vs. negative). You could switch the meaning of positive and negative and the accuracy would still be the same.

Precision & Recall

Precision and recall are two metrics talked about frequently in the data science community.

Precision is the fraction instances labeled positive that are correct.

Precision = TP / (TP + FP)

In general, a classifier can be pushed to higher precision by increasing the decision threshold, or making the classifier more conservative. In this case, the classifier only marks positive the instances on which it is most confident.

Recall is the fraction of all positive instances which are labeled positive.

Recall = TP / (TP + FN)

Recall can be pushed to 100% by decreasing the decision threshold. This will move all instances up into the TP and FP boxes.

A classifier makes a trade-off between precision and recall. A conservative classifier (marks fewer positive) will probably approach 100% specificity as it marks only the most likely instances. A permissive classifier (marks more positive) will definitely approach 100% recall as it moves all instances up into the TP and FP boxes.

Precision and recall are not symmetric like accuracy is. Furthermore, they are not ‘reflections’ of each other. Therefore, distinct, ‘reflected’ metrics of negative precision and negative recall can be defined, although these also have other names.

Sensitivity & Specificity

Two metrics that are often used in medical settings are sensitivity and specificity.

Sensitivity is the same as recall, the fraction of all positive instances that are marked positive.

Sensitivity = TP / (TP + FN)

By decreasing the decision threshold, sensitivity can be increased to 1.

Specificity is the fraction of all negative instances which are marked negative.

Specificity = TN / (TN + FP)

Specificity is the reflection of sensitivity/recall, i.e. it is ‘negative recall’ or ‘negative sensitivity’. It can be increased to 1 by increasing the decision threshold.

If both sensitivity and specificity are 1, accuracy must also be 1 and vice versa. When precision and recall are 1, accuracy must also be 1 and vice versa.


True positive rate (TPR) is the same as recall and sensitivity:

TPR = TP / (TP + FN)

False positive rate (FPR) is the same as 1 – specificity, where specificity is the same as ‘negative recall’ (recall of the negative instances).

The receiver operator characteristic (ROC) is a plot of TPR vs. FPR. A random classifier can achieve TPR = FPR for a given decision boundary, while a classifier better than random can have TPR > FPR.

TPR vs FPR for a real classifier (blue line) and a theoretical random classifier (green line).

An important metric known as AUC (area under curve) is a holistic metric of a classifier that assesses not one decision threshold, but all possible decision thresholds simultaneously. It is simply the area under the ROC curve. In the above case, it looks like AUC is about 0.7, versus 0.5 for the theoretical random classifier.

What’s unusual about precision?

Among the metrics we’ve discussed so far, precision has a unique property.

The denominator (number marked positive, i.e. TP + FP) of the precision metric changes as the decision threshold is moved. The other metrics have either actual positive or actual negative as the denominator and these numbers do not change with a moving decision boundary.

For this reason, the other metrics change monotonically as the decision threshold increases or decreases. For precision, it tends to decrease as the classifier becomes more permissive (decision threshold decreases), but there can be some jaggedness.

As the decision threshold decreases, both TP and FP increase, and in general FP will increase more quickly, as precision will eventually become lower. However, by random chance, at some points along the way, TP could increase more quickly as actual positives go from being marked negative to positive.

The precision changes jaggedly while other metrics change monotonically with decision threshold.


ROC vs Precision-Recall plot

There is a notion that in highly imbalanced data sets, the precision/recall curve gives more information than the TPR/FPR (ROC) curve. Why is this the case?

Precision penalizes false positives dynamically–in proportion to the number of total marked positive rather than total actual negative.

When total actual negative is very high, false positive rate increases very slowly relative to true positive rate, due to the much higher denominator. FPR also increases very slowly relative to the rate of decrease of precision for the same reason (denominators!).

Pay attention to recall and true positive rates, they are the same thing!

The image above shows ROC and PR curves with a classifier trained using different class weights, 1 through 100,000.

With a class weight of 1, even the ROC curve shows an obvious difference from the other versions. ROC shows little difference between any of the other curves.

The PR curve shows that class weight of 100,000 has a precision which decreases more quickly than the other class weights (5-100). However, on the ROC curve, the class weight of 100,000 is very similar to the other class weights. How do we reconcile these differences?

In this case, there is a huge class imbalance. This means that for most decision thresholds, TN is much greater than FP. Therefore, FPR indicates a much smaller deviation from “perfect” than precision.

Observe the two graphs above at around recall/TPR = 0.6. On the PR curve, the precision of the blue line (class weight = 1) decreases drastically between recall = 0.6 and 0.7. In the same span on the ROC curve, FPR barely increases. This shows the impact of the huge numbers of TN for these classifiers versus the smaller number of FP in that recall range. However, if we could zoom in on the ROC curve around 0.6, we could see the black line ticking slightly to the right of the other lines.

Another area to investigate is around TPR/recall = 0.95. In this case, the black line is a bit above the red and green lines (class weights 5 and 10) on the ROC curve. And on the PR curve, we do see a crossover in that region.


FPR = FP/(TN + FP)

Precision = TP/(TP + FP)

1 – Precision = FP/(TP + FP)

Some more posts to read on this subject:

Intuition about moving the decision threshold

Let’s consider an example classifer. Starting at a very high decision boundary (threshold ~ 1.0), all instances will be marked negative.

As the threshold is lowered, instances become labeled positive. In this instance, the first instance labeled positive is a true positive, however that is not guaranteed.

Recall,TPR = 0.1; FPR = 0; Precision = 1

The position on the ROC curve for this model is (0.1, 0) and on the PR curve it is (0.1, 1)

As the decision boundary decreases, even more instances will be labeled positive, including, inevitably, false positives.

Is this classifier any good?

Now we have more false positives (5) than true positives (2), so is this classifier any good? In order for a random classifier to achieve 2 true positives, it would need 18 false positives, on average. (A random classifier has on average TPR = FPR)

For the real classifier: TPR = 2/10 = 0.2; FPR = 5/90 = 0.06; Precision = 2/7 = 0.28

For a random classifier: TPR = 2/10 = 0.2; FPR = 18/90 = 0.2; Precision = 2/20 = 0.1

Thus, this classifier is much better than random, but is it good enough?

(Financial) cost functions

We’ve talked about many statistical metrics so far. However, in business, we care much more about financial metrics, so how would we go about estimating those?

In most cases, we can estimate the financial value (positive or negative) associated with an instance in the four quadrants of the decision matrix. Let’s now assume we’re doing a direct mail marketing campaign and we’re trying to identify which customers will convert if marketed to.

TP — we send a mailer and the customer purchases = $10 (profit of purchase) – $1 (cost of mailer) = $9

FP — we send a mailer and the customer does not convert = -$1

FN — we don’t send a mailer but the customer would have converted = $0

TN — we don’t send a mailer and the customer would not have converted = $0

So the total expected financial gain is:

$( 9*TP – 1*FP )

So this is the quantity we should maximize using model and decision boundary selection.

The financial metrics can take different forms in different scenarios. For example, now assume we are predicting churn in a

TP — we offer a discount and the customer doesn’t churn (but would have without a discount) = $100 (expected profit of purchase keeping customer) – $30 (cost of discount) = $70

FP — we offer a discount and the customer doesn’t churn, and wouldn’t have even without discount = $70

FN — we don’t offer a discount and the customer churns = $0

TN — we don’t offer a discount and the customer doesn’t churn = $100

So the total expected financial gain is:

$( 70*TP + 70*FP + 100*TN )

Again, this is the quantity we maximize with model and decision boundary selection instead of statistical metrics.

These expected profit models can get much more sophisticated, as the expected profit can depend on the attributes of the instance instead of being a constant number. This introduces another model, or set of models, to estimate the value of a customer.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s