Regression metrics when underestimation is worse than overestimation

Question

I am trying to predict time that it will take to complete some task given some data. However the important thing to me is that I would rather prefer the model to overestimate that time than underestimate even if the overall error would be smaller in the second case.

Which loss function and metrics should I use in such situation?

You can write down a loss function and then write code to minimise it. In principle that is the entire solution. Or you might find that working with root or log of time (the latter only if all times are positive) gives you an adequate approximation. — Nick Cox
– Nick Cox, Commented Jul 13, 2020 at 9:52
@NickCox What would you say to quantile regression at, say, quantile $0.25?$ This would make the model prefer to miss low than to miss high. — Dave
– Dave, Commented Jul 13, 2020 at 9:56
How can I use it with library models like for example sklearns random forest regressor? — Slajni
– Slajni, Commented Jul 13, 2020 at 10:00
@Dave That's changing the question, but the answer might be helpful. — Nick Cox
– Nick Cox, Commented Jul 13, 2020 at 10:59
How to do any of this with your preferred software is a different question and in any event I couldn't offer advice on software I've never used. — Nick Cox
– Nick Cox, Commented Jul 13, 2020 at 11:00

Nathan Wycoff · Accepted Answer · 2023-02-20 16:45:35Z

You might be interested in quantile regression. When you run a quantile regression, you get to decide how much high misses and low misses are penalized, and these do not have to be equal. You could fit a low quantile (perhaps quantile $0.75$) so that the model tends to aim high.

Quantile regression optimizes the following loss function $L_{\tau}$, where $\tau$ is the quantile you want to estimate.

$$ l_{\tau}(y_i, \hat y_i) = \begin{cases} \tau\vert y_i - \hat y_i\vert, & y_i - \hat y_i \ge 0 \\ (1 - \tau)\vert y_i - \hat y_i\vert, & y_i - \hat y_i < 0 \end{cases}\\ L_{\tau}(y, \hat y) = \sum_{i=1}^n l_{\tau}(y_i, \hat y_i) $$

When $\tau=0.5$, low and high misses are penalized equally. If $\tau>0.5$, missing low incurs a more severe penalty than missing high, incentivizing your model to miss high rather than miss low.

As far as Python goes, quantile random forests appear to be implemented in scikit-garden. More common (even if not what works for you) would be a linear quantile regression, which is implemented in sklearn and in statsmodels.

I'm not sure I say anything different in this answer than in this post, however. — Dave
– Dave, Commented Feb 20, 2023 at 16:01
Do you mind if I edit a picture of the hinge loss into this answer? — Nathan Wycoff
– Nathan Wycoff, Commented Feb 20, 2023 at 16:16
oh yes pinball indeed (how did I get such descriptive names mixed up ;)) — Nathan Wycoff
– Nathan Wycoff, Commented Feb 20, 2023 at 16:20
@JohnMadden Sure, edit in a picture of pinball loss. // A potentially related discussion about quantile regression. — Dave
– Dave, Commented Feb 20, 2023 at 16:21

Stack Exchange Network

Regression metrics when underestimation is worse than overestimation

1 Answer 1

Linked

Hot Network Questions

Regression metrics when underestimation is worse than overestimation

1 Answer 1

Linked

Related

Hot Network Questions