I just noticed that when using ridge regression, there is a small subtlety on the penalised parameters, namely, we don't penalise $\theta_0$. Can someone give me a simple and intuitive explanation of why its important to keep the intercept out of the regularization component?
I assume the following optimization expression:
$$\hat{\theta}_{\textrm{ridge}} = \underset{\theta}{\operatorname{argmin}} \quad \sum_{i \leq n} (y_i - f(x_i))^2 + \lambda \sum_{1 \leq i \leq d} \theta_i^2 $$
Where $n$ is the number of data points in our dataset and $d+1$ the number of features $(\theta_0, \ldots, \theta_d)$. Note also that the intercept is implicitly included in my $f(x_i)$ function by defining $x_i := [1 \quad x_i]^T$ in order to catch $\theta_0$.
Thanks!