1
$\begingroup$

That is, because the error is coming from the end of the neural network (ie at the output layer) and trickles back via backpropagation to the start of the neural network, does that mean that the weights near the end change more than the weights at the start. Is this true and if so, how do I show it?

Put more generally, what is the distribution of the weight deltas with respect to the network layer?

Intuitively I thought this to be true for gradient descent (left) but not for sign gradient descent (right):

enter image description here

and then ended up discovering that nanogpt trains faster with sign gradient descent instead of AdamW: https://github.com/nullonesix/sign_nanoGPT

And I'm wondering if my intuition can be proven to be true (if it is). Thank you in advance for any other intuition or resources people might have on this topic.

$\endgroup$
1
  • $\begingroup$ Someone mentioned that residual connections avoid the left side issue. $\endgroup$ Commented Jan 22 at 19:35

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.