3
$\begingroup$

I am currently training a random forest regressor (scikit learn) on the Titanic dataset.

My question is related to this issue (https://stackoverflow.com/questions/19984957/scikit-predict-default-threshold) on stack overflow.

I noted that I didn't have the same value as in scikit for measures like Precision, Recall, F1-score ... After investigating I noticed that the reason was I considered 0.5 probabilities individuals to be in class 1 while scikit classes them as 0.

So here are my questions :

  • is it better to class 0.5 probabilities individuals in 0 or 1 class ? On titanic for example it can change significantly the value of such measures.
  • would it be legit not to use these ? I do not think so because it bias your results and may tend to improve them.
  • what about classification with more than two classes ? If I have 1/3,1/3,1/3 as probabilities for one individual what should I do ?
  • is there any performance measure emancipated from this problem ?
  • is scikit-learn choosing this 0.5 -> 0 class every time or can it be random / depends on the model selected ?
$\endgroup$
4
  • 2
    $\begingroup$ @FrankHarrell makes arguments in this thread that bear directly on this question, namely whether such cutoffs are desirable. stats.stackexchange.com/questions/65382/… $\endgroup$ Commented Feb 4, 2014 at 14:52
  • $\begingroup$ yes, my question is not that far from this link. However, I'm not limiting the context to highly unbalanced datasets. Imagine a balanced dataset where there are a lot of 0.5 probabilities for classification (in {0,1}). I need to know what to do with those. It is not that much about the tradeoff but rather how to derive a performance measure on the standard maximum likelihood prediction. $\endgroup$ Commented Feb 4, 2014 at 15:10
  • 2
    $\begingroup$ This may sound dumb but you can altogether avoid this by generating an ensemble with an odd number of trees e.g. ntree=1001 $\endgroup$ Commented Feb 5, 2014 at 15:06
  • $\begingroup$ This is not dumb but I'm not sure this would be correct as not all the individuals are in each tree because of the bootstrap part... $\endgroup$ Commented Feb 5, 2014 at 16:10

3 Answers 3

7
$\begingroup$

The two very standard things you can do are (i) to assign to one class of the probability is greater than or equal to 0.5 (or whatever threshold is appropriate for your task) and the other class if the probability is less than 0.5; and (ii) have some zone of probability where the uncertainty is too great to make a decision on that basis, i.e. to have a "reject" option (for multi-class problems, reject if the difference in probability between the most probable and the second most probable class is below some cut-off value).

I have to say I disagree with those who argue against having a threshold. It depends on the needs of the application, it isn't a statistical issue. In some applications you have to make a decision, and the quality of that decision may be something we need to measure. In some applications it is acceptable to have a reject option (for instance it may be a screening test to triage cases sent for a more expensive evaluation) and some it isn't. In some applications, where perhaps the operational class frequencies or misclassification costs are unknown or are variable, in which case we are better off focussing on probability estimation in a way that is independent of the threshold (because we don't know the appropriate value of the threshold). Unfortunately there are cases where a probabilistic models give worse decisions (for a fixed threshold) than purely discriminative "hard" classifiers, such as the SVM, so we can't assume that probabilistic classifiers are a panacea - they aren't. To make the correct modelling and evaluation decisions, you need to think about the needs of the particular application, and make the choices that meet the requirements.

Having said which, I am very much in favour of probabilistic models and proper scoring rules, it is just that they are not the (full) answer to every classification problem (and neither are SVMs or DNN).

$\endgroup$
5
$\begingroup$

First, probably the best way to approach such predictions is as they are. That is, deal with the raw outputs of your model, and evaluate those predicted probabilities using proper scoring rules. See Why is accuracy not the best measure for assessing classification models? and Academic reference on the drawbacks of accuracy, F1 score, sensitivity and/or specificity for details.

However, if you insist on using hard classifications (there can be legitimate reasons), the idea that makes the most sense to me is to randomize. If you get a prediction that is right on the nose of $0.5$, randomly assign a label according to a $\text{Bernoulli}(0.5)$ distribution, such as via numpy.random.binomial(1, 0.5, 1) in Python. You might have to set numpy.random.seed earlier in your script, but handling the predictions of $0.5$ this way avoids biasing your categorical predictions toward either category. Under most circumstances, a probability right on the nose of $0.5$ should be rather rare, so this should not make much of a difference, but perhaps you can feel comfortable having covered such a scenario.

EDIT

If you decide to use a threshold $p$ other than $0.5$ (which you are allowed to do), you can randomize according to $\text{Bernoulli}(p)$ distribution, such as via numpy.random.binomial(1, p, 1) in Python.

$\endgroup$
1
  • 2
    $\begingroup$ As long as the probability isn't heavily quantised (e.g. probabilities from k-nn classifiers) probabilities of 0.5 are likely to be very rare. I would just use a threshold of >= 0.5 for the "positive class". It ought to make very little difference to the result and makes the experiments more repeatable. If it is highly quantised then randomising seems a good idea as a sort of "dithering" technique. $\endgroup$ Commented Apr 24, 2023 at 16:00
4
$\begingroup$

The question's focus on 0.5 conceals an important fact: each and every threshold applied to a continuous prediction implies some number of errors (false positives or false negatives). The question "How do a I set a threshold?" is not answerable in a vacuum, but instead depends on the application & the cost of errors. It is important to consider the cost of an error alongside the probability of the error -- amputating a limb is dramatically different from administering an unnecessary dose of antibiotics.

Even if you are compelled to choose a cutoff for some reason, it is worthwhile to consider what error rates you can tolerate. Receiver Operating Characteristic () curves are a partial answer to that question, framing the choice of a cutoff as achieving a higher (lower) true positive rate at the cost of a higher (lower) false positive rate. That said, deciding on the appropriate TPR/FPR tradeoff is also contextual & depends on the goals of the model and how it is applied.

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.