For linear model, $y=\beta_0+x*\beta+\varepsilon$, the shrinkage term is always like $P(\beta) $.
What's the reason we do not shrink the bias term $\beta_0$? Comparatively, should we shrink the bias term in the Neural network model?
For linear model, $y=\beta_0+x*\beta+\varepsilon$, the shrinkage term is always like $P(\beta) $.
What's the reason we do not shrink the bias term $\beta_0$? Comparatively, should we shrink the bias term in the Neural network model?