Suppose we have an objective function with a fixed "threshold" $\delta > 0$, for example $$ L(y_i, \hat{y}_i) = \begin{cases} (y_i - \hat{y}_i)^2 & if \ |\epsilon_i|:= |y_i - \hat{y}_i |\leq \delta \\ \delta^2 & if \ |\epsilon_i| > \delta \end{cases} $$ The intuition here is that if the prediction is far off (more than $\delta$ away), then it's a poor estimate and we are indifferent between a poor estimate and an extremely poor estimate. (This is not a great example, but let's pretend it's the case. An example application of this might be if we were evaluating predictions from different models and assigning a score.)
When we are optimizing with gradient-based methods, we generally want a nice smooth function so we might use a differentiable proxy for the above example such as standard MSE. Otherwise if $|\epsilon_i| > \delta$, then the gradient is zero which may hamper training. However the downside of this is that we penalize extremely poor estimates when we don't care about them.
In the example plot above, $y_i = 2, \delta=2$. The true objective is on the right with a threshold, and the smooth proxy is on the left.
What might be a better way of creating a "smooth threshold" such that $L(y_i, \hat{y}_i)$ is approximately constant when $|\epsilon_i| > \delta$, but still depends on $\epsilon_i$ such that it is differentiable and has non-zero gradients?


