Impact of scaling in loss terms when loss function is a composition of multiple functions

Question

I am training a deep learning model, the loss function of which is of the form

$$ \cal{L} = \cal{L_1} + \cal{L_2} $$

where $\cal{L_1}$ and $\cal{L_2}$ are of very different orders. WLOG, let's assume the order of $\cal{L_1}$ is much higher than the order of $\cal{L_2}$.

During the first several epochs of training, the model will attempt to minimize $\cal{L_1}$ largely. However, after a certain number of epochs, the value of $\cal{L_1}$ will converge.

My question is, what will happen now? Specifically, I have three questions:

Does the convergence of $\cal{L_1}$ imply the convergence of $\cal{L}$, which means the training is over and the loss function behaved as if it was essentially $\cal{L} = \cal{L_1}$?
Since $\cal{L_1}$ has now converged, does that imply $\frac{\partial{\cal{L_1}}}{\partial{\theta}} \approx 0$? (where $\theta$ are the model parameters)
If the above point is true, then since the model parameters are updated based on $\frac{\partial{\cal{L}}}{\partial \theta}$, does that imply that the model will now start minimizing $\cal{L_2}$ (since $\frac{\partial{\cal{L}}}{\partial \theta} \approx \frac{\partial \cal{L_2}}{\partial \theta}$)?

Alberto · Accepted Answer · 2024-01-23 09:42:01Z

not generally, consider a case where $L_2 = 1/L_1$, if one converges to 0, the other one diverges
Yes, that's the usual definition of convergence, though the "$\approx$" is not really well defined usually
exactly, however, after one update, most likely you will have that $\frac{dL_1}{d\theta}$ won't be 0 anymore

Stack Exchange Network

Impact of scaling in loss terms when loss function is a composition of multiple functions

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Impact of scaling in loss terms when loss function is a composition of multiple functions

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions