Does Multinomial Probability Calibration Consider the Probabilities of the Non-Dominant Classes?

Question

The gist behind Harrell's rms::calibrate function makes sense to me. While I have yet to understand the magic that lets us calculate the "true" probabilities, I get the idea of comparing the predicted probabilities to the true probabilities, particularly in a simulation setting, where we do know the true probabilities.

set.seed(2021) N <- 1000 x <- runif(N, -2, 2) z <- x pr <- 1/(1 + exp(-z)) y <- rbinom(N, 1, pr) L <- glm(y ~ x, family = binomial) prob_preds <- 1/(1 + exp(-predict(L))) plot(pr, prob_preds) abline(a = 0, b = 1)

The plot is, more-or-less, the line $y = x$, meaning that the predicted probabilities align with what the true probability (given by pr) of $y=1$ is, given a value of $x$.

In the binary setting, it makes sense to me that, if we have a bunch of probability predictions around $0.8$, then about $80\%$ of them should have $y=1$ and about $20\%$ of them should have label $y=0$.

In the multiclass setting, something similar should apply. If we want to predict the probability that, mutually exclusively, a photograph contains a yellow lab, golden retriever, or an Irish setter (three dog breeds), then if we get a bunch of probability predictions like $(P(\text{Lab}) = 0.7, P(\text{Golden}) = 0.2, P(\text{Setter}) = 0.1)$, about $70\%$ should be yellow labs, $20\%$ golden retrievers, and $10\%$ Irish setters. If we have that $70\%$ are yellow labs but the rest are golden retrievers, then the model is miscalibrated.

Guo's "On Calibration of Modern Neural Networks" seems only to calibrate the dominant class. That is, the paper aims to make sure that the $P(\text{Lab}) = 0.7$ is calibrated, but it totally misses the golden retriever and Irish setter probabilities. This is not an issue in the binary setting. If we just want to tell labs from goldens, then having $P(\text{Lab})$ be calibrated means that $P(\text{Golden})$ is calibrated, too (assuming every photo has exactly one dog breed present), since the two proabilities must add to $1$. Once we introduce a third class, however, knowing the probability of one class does not give us the probability of any other class.

For instance, Guo's equation (1) makes mention of the predicted class. If the prediction (or at least dominant class) is a yellow lab, what about the probabilities of being a golden retriever or an Irish setter? Guo's section 4.2 reinforces my belief that the non-dominant classes are not calibrated, since it only considers the maximum probability returned by the softmax function (ergo the probability of the dominant class).

Am I missing a place where Guo calibrated the probabilities of the non-dominant classes?

Guo, Chuan, et al. "On calibration of modern neural networks." International Conference on Machine Learning. PMLR, 2017.

(In my terminology, the golden retriever and Irish setter are called the non-dominant classes, since they do not have the highest probability of class membership, and the yellow lab is the dominant class, since it has the highest probability of class membership. Note that none of the classes need to exceed probabilities of $0.5$, so a dominant class could have a probability of $0.45$ if the other two have probabilities of $0.3$ and $0.25$, for instance.)

Stephan Kolassa · Accepted Answer · 2021-12-16 12:35:01Z

From what I understand in the paper, you are correct: it does not (necessarily) calibrate the non-majority class probabilities. (Actually, I am not completely confident that it correctly calibrates the majority class probabilities.)

Guo et al.'s equations (1) and (2) could in principle be used for multiclass situations, with some generosity in interpreting them - after all, they are "$\forall p\in[0,1]$". The problem crops up later, in section 4.2, where the output of an ostensibly multiclass classifier is defined to be a single class and its associated "confidence", via the argmax function: the predicted class for the $i$-th instance is $\hat{y}_i=\text{argmax}_k \,z_i^{(k)}$, where $k$ runs over the possible classes, and $z_i^{(k)}$ is the logit. Thus, they explicitly only consider the most likely class.

Later, they use the standard method of reducing a $K$-class problem into $K$ "one against the rest" binary classification problems. They apply their recalibration approach to each one of these $K$ problems, yielding $K$ (hopefully) calibrated binary classifiers. However, these will likely not be consistent, i.e., the predicted probabilities will not sum to $1$. So Guo et al. normalize them. But I don't quite see how this preserves calibration - for the minority or the majority classes.

pink lord · Accepted Answer · 2023-11-28 09:01:45Z

0

I found this paper useful, the pairwise ECE they proposed is a simple adoption of the ECE that takes into consideration of the non majority class probabilities.

answered Nov 28, 2023 at 9:01

pink lord

1

$\begingroup$ As it’s currently written, your answer is unclear. Please edit to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers in the help center. $\endgroup$

Community
– Community Bot

2023-11-28 09:06:31 +00:00
Commented Nov 28, 2023 at 9:06
$\begingroup$ While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes. - From Review $\endgroup$

Shawn Hemelstrand
– Shawn Hemelstrand

2023-11-28 09:34:53 +00:00
Commented Nov 28, 2023 at 9:34

Add a comment |

Stack Exchange Network

Does Multinomial Probability Calibration Consider the Probabilities of the Non-Dominant Classes?

2 Answers 2

Linked

Hot Network Questions

Does Multinomial Probability Calibration Consider the Probabilities of the Non-Dominant Classes?

2 Answers 2

Linked

Related

Hot Network Questions