On a very informal level, if we were to compare the (classical) Gradient Descent Algorithm to the Stochastic Gradient Descent Algorithm, the first thing that comes to mind is:
- Gradient Descent can be considered as a slower algorithm than Stochastic Gradient Descent, since it requires performing mathematical operations (e.g. derivatives, second derivatives) on the entire data set. On the other hand, Stochastic Gradient Descent can be considered as a faster algorithm, since it approximates the gradient using "mini batches" of observations from the data set. Logically, this makes sense : Stochastic Gradient Descent requires fewer calculations compared to Gradient Descent, therefore the same computer should take less time to perform fewer calculations compared to more calculations.
However, (in the spirit of tradeoffs) the somewhat logical counter point to the above argument is that:
- Since Stochastic Gradient Descent approximates the gradient of the function whereas Gradient Descent evaluates the full gradient - one would imagine that perhaps Gradient Descent might have a better ability to find the true minimum of the function compared to Gradient Descent. However, I am not sure about this.
My Question: In the past few years, a lot of research has been done about studying the behavior (e.g. theoretical convergence properties) of Stochastic Gradient Descent (e.g. https://arxiv.org/abs/2103.14350 , https://raghumeka.github.io/CS289ML/gdnotes.pdf) which have demonstrated similar abilities of the Stochastic Gradient Descent Algorithm to converge as compared to Gradient Descent.
But are there any theoretical results which expose the above (speculated) tradeoff? At the end of the day, if Stochastic Gradient requires less computing power when compared to Gradient Descent - are there any theoretical results that suggest Gradient Descent has an inherent ability to better find the minimum of the function (e.g. perhaps less likely to get "stuck" in saddle points) since it is not relying on an approximation of the gradient? Or if Stochastic Gradient Descent is equal to Gradient Descent in this regard - could Stochastic Gradient Descent be then considered as superior to Gradient Descent?
Can someone please comment on this?
Thanks!

