Second Moment (Uncentered Variance) Estimate of Gradient

Question

I am reading Kingma and Lei Ba's paper introducing the Adam optimizer. I was looking over some derivations for the second moment estimate:

I noticed that they find the sum of a finite geometric series from the second to third equation in the image. The equation for finding the sum is:

But they don't seem to multiply by the first term of the series. What am I missing? If this is some kind of approximation, why is it allowed/favorable?

Davide Giraudo · Accepted Answer · 2025-09-01 09:11:18Z

Doing the change of index $j=t-i$, we get $$ (1-\beta_2)\sum_{i=1}^t\beta_2^{t-i}=(1-\beta_2)\sum_{j=0}^{t-1}\beta_2^j= \sum_{j=0}^{t-1}\left(\beta_2^j-\beta_2^{j+1}\right)=1-\beta_2^t, $$ where the last equality follows from the fact that the sum is telescopic.

Stack Exchange Network

Second Moment (Uncentered Variance) Estimate of Gradient

1 Answer 1

Hot Network Questions

Second Moment (Uncentered Variance) Estimate of Gradient

1 Answer 1

Related

Hot Network Questions