0
$\begingroup$

I was fitting a bunch of logistic regression models to some dataset, where the variables to predict were all binary. After I fit the models, I then ran some simple code to use the ROC curve to find the best thresholds for each of them, and I noticed all the thresholds I obtained coincided with the mean of the predicted variables (i.e. for the variable $y_i$, if $16\%$ of it was ones and the rest zeros, the ROC curve method suggested a threshold of $0.16$, and so on with the others).

I also fit other models where you can overcome data imbalance by setting the weight of each class, such as random forests or XGB; and the suggested thresholds from the ROC curve for those where more diverse. So my questions are: is the ROC curve’s optimal threshold always the same as the mean of the predicted variable in the case of logit models? And if so, why does it happen? (could be just some intuition on why it works). Lastly, are there any other models where this happens?

$\endgroup$
3
  • $\begingroup$ "Best" threshold according to what definition of "best"? $\endgroup$ Commented Dec 3, 2024 at 19:54
  • $\begingroup$ The one that produces the point closest to the upper-left corner in the ROC curve. “Closest” according to either the $L^1$ or the $L^2$ metric. The code I linked uses the $L^1$ distance. $\endgroup$ Commented Dec 3, 2024 at 20:43
  • $\begingroup$ Divide all of your predictions by two (or ten). Does the ROC curve change? Does the optimal threshold change? $\endgroup$ Commented Dec 3, 2024 at 21:07

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.