Calculating "accuracy", "recall" etc. without classification

Question

I have a set of models, that I'm comparing to each other with respect to prediction of a binary event. I'm using a few proper scores (Brier, log), but I also need accuracy, recall, sensitivity etc., since the people I work with are more likely to understand them.

However, I'd like to do this without applying an arbitrary classification threshold, so I'm currently I'm calculating the following, with forecast probability $f_k$ and event indicator $e_k$ for observation $k$:

True positives $TP = \sum_k f_k[e_k = 1]$;
True negatives $TN = \sum_k (1-f_k)[e_k = 0]$;
False positives $FP = \sum_k f_k[e_k = 0]$;
False negatives $FN = \sum_k (1-f_k)[e_k = 1]$.

This gives me an "average confusion" matrix, with real "counts" rather than integer ones, which I use to calculate the metrics as usual. Are there any issues / biases I'm introducing by doing this, in addition to the ones inherent to using improper scoring rules?

Imagine a binary classifier that returns 50% 1s and 50% 0s for the positive class vs a probabilistic one that returns probability 0.5 for all the examples in the positive class. The first classifier has 50% true positive rate, and your metric is also 50%, but if you used a proper threshold, the second classifier you'd have 100% TPR, so it doesn't really calculate the metrics "regardless" of the threshold. — Tim
– Tim, Commented Jul 29, 2022 at 15:58
The fact that these aren't equal to accuracy/recall/etc and that you're telling your customer incorrect information might be problematic, particularly if you tell them that incorrect information disguised as familiar terms. — Dave
– Dave, Commented Feb 11, 2024 at 2:41

Stephan Kolassa · Accepted Answer · 2022-07-29 17:02:30Z

I think the main problem is that you will get the same problems as for the underlying KPIs, just in a probabilistic flavor. For instance, let's assume that conditional on your predictors, a given instance has a certain true probability or prevalence of belonging to the target class, and let us assume we calculate an "expected probabilistic accuracy" by the sum of the expectations of your $TP+TN$ fractions:

If the true probability of belonging to the target class is lower than $0.5$, maximizing our expected "probablistic accuracy" will pull our forecast $f$ towards $0$. And if the true probability is higher than $0.5$, we can maximize the expected "probabilistic accuracy" by $f\to 1$. The same problems will occur with precision, recall etc.

I also need accuracy, recall, sensitivity etc., since the people I work with are more likely to understand them.

To be honest, I think people only think they understand accuracy etc., because they seem to be easy to grasp. Just as a flat Earth seems "obvious", or Newtonian mechanics. However, I think that a flat Earth and Newtonian mechanics are more often useful approximations than accuracy and so forth.

R code:

true_probability <- c(0.3, 0.8) opar <- par(mfrow=c(1,length(true_probability)),las=1,mai=c(.8,.8,.5,.1)) Forecast <- seq(0,1,by=0.01) for ( ii in seq_along(true_probability) ) { ETP <- forecast*true_probability[ii] ETN <- (1-forecast)*(1-true_probability[ii]) Accuracy <- ETP+ETN plot(Forecast,Accuracy,type="l",main=paste("True probability", true_probability[ii])) }

Thanks. As I mentioned, I'm aware that accuracy has the effect of rewarding over-confident forecasts. This is something I've brought up with my colleagues, but for the current work it's a moot point. I'm interested in whether this approach introduces any additional problems, in exchange for not introducing an arbitrary threshold. — Accidental Statistician
– Accidental Statistician, Commented Jul 29, 2022 at 17:54
Yes, I thought you were. I have to admit that I am playing to the gallery a bit with my answer. I agree that it might be interesting to see whether your approach adds any new problems. — Stephan Kolassa
– Stephan Kolassa, Commented Jul 29, 2022 at 17:57

Accidental Statistician · Accepted Answer · 2024-02-10 23:04:27Z

Some observations, now I've had time to think about what I did.

While I want to evaluate the model itself, rather than the decisions it will be used for, at the time of decision-making there will be a chosen threshold. Calculating expected confusion counts in the manner described above is equivalent to my belief in the decision thresholds used being a uniform distribution over $[0, 1]$.

One issue with using the mean confusion values to calculate the other metrics is that it causes some of them to be biased. For example, mean recall/sensitivity is calculated correctly, since its denominator is threshold-independent: $$\mathbb{E}(\textrm{sens}) = \mathbb{E}\left(\frac{TP}{TP + FN}\right) = \frac{\mathbb{E}(TP)}{\textrm{# actual positives}},$$ where expectation ($\mathbb{E}$) is taken over the threshold value. We also correctly calculate the mean specificity correctly. However, for precision/PPV and NPV, using the mean confusion values introduces a bias, since the denominator depends on the threshold: $$\mathbb{E}(\textrm{prec}) = \mathbb{E}\left(\frac{TP}{TP + FP}\right) = \frac{\mathbb{E}(TP)}{\mathbb{E}(TP) + \mathbb{E}(FP)} + \mathcal{O}(\textrm{Cov}(TP, FP) + \textrm{Var}(FP)) \, \textrm{as # observations } \uparrow \infty.$$ At the moment, I'm not sure how I would easily calculate the correct mean values for these.

While not bringing up additional problems, here are some additional comments on evaluating a model by applying a belief distribution to the thresholds that will be applied for decision-making. This is not as unusual as it might seem: proper scoring rules can often be described in the same way.

For example, we can follow Rosen [1] or Merkle and Steyvers [2] in considering a classification loss function, with the loss dependent on the threshold value, not the forecast value. The Brier score is then equivalent to using this under a uniform distribution for the threshold; the log score would use a Haldane prior.

By comparison, let's look at using the mean accuracy. This uses a uniform distribution for the threshold, like the Brier score, but the loss function used is based on the forecast value instead. Indeed, we can find that the mean accuracy is equivalent to the "linear score", with loss function $S(f, y) = f(1-y) + (1-f)y$ for forecast $f$ and observation $y$, the classical example of a reasonable-looking scoring rule which turns out to be improper.

We can, therefore, say that accuracy is to the linear score as misclassification loss is to the Brier score.

[1] Rosen, D.B. (1996). How Good were those Probability Predictions? The Expected Recommendation Loss (ERL) Scoring Rule. In: Heidbreder, G.R. (eds) Maximum Entropy and Bayesian Methods. Fundamental Theories of Physics, vol 62. Springer, Dordrecht. https://doi.org/10.1007/978-94-015-8729-7_33

[2] Edgar C. Merkle, Mark Steyvers (2013) Choosing a Strictly Proper Scoring Rule. Decision Analysis 10(4):292-304. https://doi.org/10.1287/deca.2013.0280

Stack Exchange Network

Calculating "accuracy", "recall" etc. without classification

2 Answers 2

Linked

Hot Network Questions

Calculating "accuracy", "recall" etc. without classification

2 Answers 2

Linked

Related

Hot Network Questions