Should the lambda for L1 norm regularizer inversely be proportional to the number of trainable weights?

Question

Say I want to implement Conv2D in keras and for each Conv2D layer, if I apply 20 filters of [2,3] filter on an input with depth of 10, then there will be 20*(2*3*10+1) = 1220 trainable weights.

the value of L1 norm would proportionally increase the more trainable weights there are. Similarly for L2 norm.

So shouldn't the lambda, as in kernel_regularizer=l1(lambda), be inversely be proportional to the number of trainable weights?

intuitively for me, if lambda of 0.1 worked for 10,000 weights, then applying the same or bigger lambda for 1 million weights doesn't make sense to me.

10xAI · Accepted Answer · 2020-06-02 13:10:53Z

Backpropagation doesn't handle Regularization like this
i.e. if you are thinking "10 weights make penalty 100, so 100 weights will make penalty 1000. So let's have a smaller $\lambda$

Backpropagation uses the partial differentiation of the Loss.
Now the Loss has an extra $\Sigma$$w_i^2$ and so the derivative will have an extra piece which will be proportional to the weight(Derivative of $x^2$ = $2*x$).
It will be separate for every weight, so the number of weights will have no impact

The new equation will be -
$w_i$ = $w_i$ - (backprop stuff as before) - $\lambda$*$w_i$

$\lambda$ is just a knob. You also have any value proportional to $w_i$ e.g. 5% of $w$.

Lucas Morin · Accepted Answer · 2020-04-28 09:13:10Z

In general, lambda is an hyperparameter and should be determined by a specific technique. The most common one is by grid search.

Finding a good lambda value for 10.000 can give you an idea of the range of the grid for 1M weights, but you shouldn't use a single value determined by a rule of thumb.

This is more specificly true when varying the number of weight of your model, in general this also means taking more data, which can have really differents properties (as in less predictive power). Given the optimisation problem, this can lead to very different value of lambda. Basically your tought is valaible before the optimisation step, but not after, as weigth can get different values.

That's true. But if a certain lambda gives exploding gradient (i.e. the loss gets too large), while the exact same model beforebthe regularizer didn't give any exploding gradient, then does decreasing the lambda necessarily prevent that? — Kevin Kim
– Kevin Kim, Commented Apr 28, 2020 at 9:29
That's an entirely different question... as far as I know exploding gradient problems come from the NN architecture / activation not the optimisation / regularisation. — Lucas Morin
– Lucas Morin, Commented Apr 28, 2020 at 9:33

Stack Exchange Network

Should the lambda for L1 norm regularizer inversely be proportional to the number of trainable weights?

2 Answers 2

Hot Network Questions

Should the lambda for L1 norm regularizer inversely be proportional to the number of trainable weights?

2 Answers 2

Related

Hot Network Questions