Skip to main content

Reason for not shrink the bias term

For linear model, $y=\beta_0+x*\beta+\varepsilon$, the shrinkage term is always like $P(\beta) $.

What's the reason we do not shrink the bias term $\beta_0$? Comparatively, should we shrink the bias term in the Neural network model?

yliueagle
  • 855
  • 2
  • 8
  • 11