Ridge regression subtlety on intercept

Question

I just noticed that when using ridge regression, there is a small subtlety on the penalised parameters, namely, we don't penalise $\theta_0$. Can someone give me a simple and intuitive explanation of why its important to keep the intercept out of the regularization component?

I assume the following optimization expression:

$$\hat{\theta}_{\textrm{ridge}} = \underset{\theta}{\operatorname{argmin}} \quad \sum_{i \leq n} (y_i - f(x_i))^2 + \lambda \sum_{1 \leq i \leq d} \theta_i^2 $$

Where $n$ is the number of data points in our dataset and $d+1$ the number of features $(\theta_0, \ldots, \theta_d)$. Note also that the intercept is implicitly included in my $f(x_i)$ function by defining $x_i := [1 \quad x_i]^T$ in order to catch $\theta_0$.

Thanks!

Demetri Pananos · Accepted Answer · 2022-01-06 10:38:56Z

I will give you an unrigorous but intuitive reason as to why the intercept is not penalized. When we estimate a penalized model, we usually scale and centre the predictors. This means that the intercept is estimated to be the mean of the outcome variable.

Note that the mean of the outcome variable is the simplest prediction we could make (aside form predicting a random number unrelated to the outcome, in which case why use data at all, right?). Aside from the simplicity, the sample mean is also the minimizer of squared loss when we don't consider any other variables.

Penalizing the intercept would mean that we would bias our model's predictions away from the sample mean in extreme cases when all model parameters are shrunk towards 0. This would result in poorer predictions than we could make otherwise, or put another way, that we could actually minimize squared error further.

Thank you for your explanation. I clearly understand that if the intercept is penalised it will tend to shrink towards 0 as well and move away from the mean of the outcome variable; and this would be a bad thing. But could you somehow rephrase your first paragraph? Because I don't really get why the intercept is estimated as the mean of the outcome variables. — Dime
– Dime, Commented Jan 6, 2022 at 10:56
@akiro this is a well known result in regression when the predictor has mean 0. It also follows from some simple algebra. If the model is $y= b_0 + b_1x$ and $E(x)=0$ then $b_0 = E(y)$. — Demetri Pananos
– Demetri Pananos, Commented Jan 6, 2022 at 11:00
(+1) As a horror story for Stats software quality: A very popular Python implementation of ridge regression unintentionally penalised the intercept. (cause someone effectively solved for $(X^TX+\lambda I)^{-1}Xy$ directly. Hilarity ensued when they tried to compare their results using R and MATLAB. — usεr11852
– usεr11852, Commented Jan 7, 2022 at 4:34

Stack Exchange Network

Ridge regression subtlety on intercept

1 Answer 1

Linked

Hot Network Questions

Ridge regression subtlety on intercept

1 Answer 1

Linked

Related

Hot Network Questions