That is, because the error is coming from the end of the neural network (ie at the output layer) and trickles back via backpropagation to the start of the neural network, does that mean that the weights near the end change more than the weights at the start. Is this true and if so, how do I show it?
Put more generally, what is the distribution of the weight deltas with respect to the network layer?
Intuitively I thought this to be true for gradient descent (left) but not for sign gradient descent (right):
and then ended up discovering that nanogpt trains faster with sign gradient descent instead of AdamW: https://github.com/nullonesix/sign_nanoGPT
And I'm wondering if my intuition can be proven to be true (if it is). Thank you in advance for any other intuition or resources people might have on this topic.
