4
$\begingroup$

Suppose we have an objective function with a fixed "threshold" $\delta > 0$, for example $$ L(y_i, \hat{y}_i) = \begin{cases} (y_i - \hat{y}_i)^2 & if \ |\epsilon_i|:= |y_i - \hat{y}_i |\leq \delta \\ \delta^2 & if \ |\epsilon_i| > \delta \end{cases} $$ The intuition here is that if the prediction is far off (more than $\delta$ away), then it's a poor estimate and we are indifferent between a poor estimate and an extremely poor estimate. (This is not a great example, but let's pretend it's the case. An example application of this might be if we were evaluating predictions from different models and assigning a score.)

When we are optimizing with gradient-based methods, we generally want a nice smooth function so we might use a differentiable proxy for the above example such as standard MSE. Otherwise if $|\epsilon_i| > \delta$, then the gradient is zero which may hamper training. However the downside of this is that we penalize extremely poor estimates when we don't care about them.

Loss

In the example plot above, $y_i = 2, \delta=2$. The true objective is on the right with a threshold, and the smooth proxy is on the left.

What might be a better way of creating a "smooth threshold" such that $L(y_i, \hat{y}_i)$ is approximately constant when $|\epsilon_i| > \delta$, but still depends on $\epsilon_i$ such that it is differentiable and has non-zero gradients?

$\endgroup$

2 Answers 2

3
$\begingroup$

I suggest using the pseudo-Huber loss function. Behaves like $L_1$ away from $\delta$ and quadratic close to $\delta$. That way, we have a differential loss function with non-zero gradients everywhere as asked. We can tweak it to make it flatter than $L_1$ if we want too (but I wouldn't recommend it, as even very bad predictions can contain "some" signal).

EDIT: A quick-and-dirty way to affect the behaviour of the pseudo-Huber loos such that it tends to zero as the residuals $\alpha$ get larger is by manipulating the exponent of $(1 + (\frac{\alpha}{\delta})^2$). The original exponent to that ratio is $0.5$ (i.e. we take the square root of it) but lower values will cause the shape of the loss function outside $[-\delta,\delta]$ to become convex. As consequence, the derivatives of it will not approximate $\delta$ any longer but they will smoothly decay towards zero. I made a quick plot below:

enter image description here

The code to generate the plot is this:

library(numDeriv) custom_pseudo_huber_loss <- function(res, delta, tails=0.5) { loss <- delta^2 * ((1 + (res/ delta)^2)^(tails) - 1) return(loss) } my_delta = 2 set.seed(44) res <- sort(runif(1234,-15 ,45)) par(mfrow=c(2,2),pty="s") plot(x=res, main="Using original exponent (0.5)", y= custom_pseudo_huber_loss(res, delta=my_delta), type = 'l', xlab='Residual', ylab="Pseudo Huber loss value") abline(v = c(-1,1)*my_delta, col='red') grid() legend("topleft", c("function vals","delta vals"),col=c("black","red"), lty=1) plot(x=res, main="Using original exponent (0.5)", y= grad( splinefun(x=res, y=custom_pseudo_huber_loss(res, delta=my_delta), "natural"), x = res), type = 'l', xlab='Residual', ylab="Pseudo Huber loss derivative value", ylim=c(-1,1)*my_delta) abline(h = c(-1,1)*my_delta, col='red') grid() tails = 0.123 plot(x=res, main= paste0("Using new exponent (", tails,")"), y= custom_pseudo_huber_loss(res, delta=my_delta, tails=tails), type = 'l', xlab='Residual', ylab="Pseudo Huber loss value") abline(v = c(-1,1)*my_delta, col='red') grid() plot(x=res, main= paste0("Using new exponent (", tails,")"), y= grad( splinefun(x=res, y=custom_pseudo_huber_loss(res, delta=my_delta, tails=tails), "natural"), x = res), type = 'l', xlab='Residual', ylab="Pseudo Huber loss derivative value", ylim=c(-1,1)*my_delta) abline(h = c(-1,1)*my_delta, col='magenta') abline(v = c(-1,1)*my_delta, col='red') grid() legend("topleft", c("function vals","delta vals","original limits"),col=c("black","red","magenta"), lty=1) 
$\endgroup$
4
  • $\begingroup$ Thanks! I am familiar with the Huber loss, but didn't think about adjusting the parameters like that. However I'm looking for a more general solution for any loss $L(y, \hat{y})$ with a threshold, not necessarily MSE inside $[-\delta, \delta]$. Furthermore it seems that adjusting the parameters would also greatly affect the shape of the loss inside the interval, which is undesirable. $\endgroup$ Commented Oct 16, 2024 at 14:10
  • $\begingroup$ I was thinking along the lines of some sort of smooth extension of $L$ inside $[-\delta, \delta]$ to the constant function outside the interval. I think this should be possible, by Whitney's Extension Theorem for example. Perhaps there's a way to do this using smooth bump functions. $\endgroup$ Commented Oct 16, 2024 at 14:19
  • 1
    $\begingroup$ (Sorry forgot answering at the time) We can increase the loss "inside the threshold" too by amping the other exponents, of course it takes a bit of playing around. I really think that the constant function is not going to be beneficial because if say we have a loss of 100 arbitrary units and that is "bad", we want to know that if we move to 99 we are doing "better". If outside our interval $r_\delta = [-\delta, \delta]$ the loss is fixed, then our gradient methods are moot. That's why I mentioned a $L_1$ loss outside $r_\delta$; zero gradient indicates that we may have found an optimum... $\endgroup$ Commented Oct 22, 2024 at 9:46
  • 1
    $\begingroup$ Yes agreed, zero gradients outside the interval is not ideal. Though the Huber's specific parametric form may be too restrictive for general loss functions, since it is really designed for $L_2$ inside and $L_1$ outside, even if we were to tune the parameters. However I have taken inspiration from your ideas and figured it out. $\endgroup$ Commented Oct 22, 2024 at 15:05
2
$\begingroup$

I have figured out the solution. Let $L^{(1)}, L^{(2)}$ denote the "inner" and "outer" losses respectively. In the example given in the question, we have $$ L(y, \hat{y}) = \begin{cases} L^{(1)}(y, \hat{y}) = (y-\hat{y})^2 & if |y-\hat{y}| \leq \delta \\ L^{(2)}(y,\hat{y}) = \delta^2 & else \end{cases} $$ The trick here is as follows. Instead of viewing this thresholded loss as a piecewise function, we may think of it as \begin{align*} L(y,\hat{y}) &= \min\left( L^{(1)}(y,\hat{y}), L^{(2)}(y,\hat{y}) \right) \\ &= -\max\left( -L^{(1)}(y,\hat{y}), -L^{(2)}(y,\hat{y}) \right) \end{align*} We can then use various "soft-max" operations from the ML literature, for example log-sum-exp: $$ L(y,\hat{y}) \approx -\log\left(\exp(-L^{(1)}(y,\hat{y}) + \exp(-L^{(2)}(y,\hat{y})\right) $$ In the example above, a smooth proxy is then $$ L(y,\hat{y}) \approx -\log\left(e^{-(y-\hat{y})^2)} + e^{-\delta^2}\right) $$

pic

$\endgroup$
2
  • 1
    $\begingroup$ Note: Whether this is a "good" loss is an entirely different story. As @usεr11852 points out, losses outside the interval lead to zero-gradients in the thresholded loss. With the smoothed version, the gradients are non-zero but asymptotically approach it, which only mitigates the problem to some extent. However it would still create lots of local minima in the loss surface. In some applications, a smooth thresholded loss can work great but in others, it may be better to just use the non-thresholded loss (only $L^{(1)})$ at least for training. $\endgroup$ Commented Oct 22, 2024 at 15:13
  • $\begingroup$ Thank you coming back and writing this. It is definitely useful. (+1) $\endgroup$ Commented Oct 22, 2024 at 15:47

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.