The gist behind Harrell's rms::calibrate function makes sense to me. While I have yet to understand the magic that lets us calculate the "true" probabilities, I get the idea of comparing the predicted probabilities to the true probabilities, particularly in a simulation setting, where we do know the true probabilities.
set.seed(2021) N <- 1000 x <- runif(N, -2, 2) z <- x pr <- 1/(1 + exp(-z)) y <- rbinom(N, 1, pr) L <- glm(y ~ x, family = binomial) prob_preds <- 1/(1 + exp(-predict(L))) plot(pr, prob_preds) abline(a = 0, b = 1) The plot is, more-or-less, the line $y = x$, meaning that the predicted probabilities align with what the true probability (given by pr) of $y=1$ is, given a value of $x$.
In the binary setting, it makes sense to me that, if we have a bunch of probability predictions around $0.8$, then about $80\%$ of them should have $y=1$ and about $20\%$ of them should have label $y=0$.
In the multiclass setting, something similar should apply. If we want to predict the probability that, mutually exclusively, a photograph contains a yellow lab, golden retriever, or an Irish setter (three dog breeds), then if we get a bunch of probability predictions like $(P(\text{Lab}) = 0.7, P(\text{Golden}) = 0.2, P(\text{Setter}) = 0.1)$, about $70\%$ should be yellow labs, $20\%$ golden retrievers, and $10\%$ Irish setters. If we have that $70\%$ are yellow labs but the rest are golden retrievers, then the model is miscalibrated.
Guo's "On Calibration of Modern Neural Networks" seems only to calibrate the dominant class. That is, the paper aims to make sure that the $P(\text{Lab}) = 0.7$ is calibrated, but it totally misses the golden retriever and Irish setter probabilities. This is not an issue in the binary setting. If we just want to tell labs from goldens, then having $P(\text{Lab})$ be calibrated means that $P(\text{Golden})$ is calibrated, too (assuming every photo has exactly one dog breed present), since the two proabilities must add to $1$. Once we introduce a third class, however, knowing the probability of one class does not give us the probability of any other class.
For instance, Guo's equation (1) makes mention of the predicted class. If the prediction (or at least dominant class) is a yellow lab, what about the probabilities of being a golden retriever or an Irish setter? Guo's section 4.2 reinforces my belief that the non-dominant classes are not calibrated, since it only considers the maximum probability returned by the softmax function (ergo the probability of the dominant class).
Am I missing a place where Guo calibrated the probabilities of the non-dominant classes?
(In my terminology, the golden retriever and Irish setter are called the non-dominant classes, since they do not have the highest probability of class membership, and the yellow lab is the dominant class, since it has the highest probability of class membership. Note that none of the classes need to exceed probabilities of $0.5$, so a dominant class could have a probability of $0.45$ if the other two have probabilities of $0.3$ and $0.25$, for instance.)
