Do weights update less towards the start of a neural network?

Asked 10 months ago

Viewed 62 times

That is, because the error is coming from the end of the neural network (ie at the output layer) and trickles back via backpropagation to the start of the neural network, does that mean that the weights near the end change more than the weights at the start. Is this true and if so, how do I show it?

Put more generally, what is the distribution of the weight deltas with respect to the network layer?

Intuitively I thought this to be true for gradient descent (left) but not for sign gradient descent (right):

and then ended up discovering that nanogpt trains faster with sign gradient descent instead of AdamW: https://github.com/nullonesix/sign_nanoGPT

And I'm wondering if my intuition can be proven to be true (if it is). Thank you in advance for any other intuition or resources people might have on this topic.

asked Jan 22 at 18:10

Null Six

112 bronze badges

$\begingroup$ Someone mentioned that residual connections avoid the left side issue. $\endgroup$

Null Six
– Null Six

2025-01-22 19:35:08 +00:00
Commented Jan 22 at 19:35

Add a comment |

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Stack Exchange Network

Do weights update less towards the start of a neural network?

0

Hot Network Questions

Do weights update less towards the start of a neural network?

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Related

Hot Network Questions