1
$\begingroup$

If I would do loss = loss/10 before calculating the gradient would that change the amount of change applied to the model parameters during back propagation?

Or is the amount of change only dependent on the direction of the gradient and the learning rate?

I'm especially interested in how this would work in pytorch.

$\endgroup$

1 Answer 1

2
$\begingroup$

By the chain rule, scaling the loss by a scalar value c, ie loss = c*loss, will cause all gradients computed via backprop to also be scaled by c. ie loss -> c*loss, grad -> c*grad

Scaling the gradients by c changes the magnitude of the gradient vectors but not the direction.

In a gradient descent context, scaling the gradients by c is equivalent to scaling the learning rate by c. ie:

loss = ... w_new = w_old - lr * grad 

becomes

loss = c*loss w_new = w_old - lr * c * grad # scaling loss by c -> scaling grad by c w_new = w_old - lr_scaled * grad # lr_scaled = lr * c 
$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.