3
$\begingroup$

On a very informal level, if we were to compare the (classical) Gradient Descent Algorithm to the Stochastic Gradient Descent Algorithm, the first thing that comes to mind is:

  • Gradient Descent can be considered as a slower algorithm than Stochastic Gradient Descent, since it requires performing mathematical operations (e.g. derivatives, second derivatives) on the entire data set. On the other hand, Stochastic Gradient Descent can be considered as a faster algorithm, since it approximates the gradient using "mini batches" of observations from the data set. Logically, this makes sense : Stochastic Gradient Descent requires fewer calculations compared to Gradient Descent, therefore the same computer should take less time to perform fewer calculations compared to more calculations.

However, (in the spirit of tradeoffs) the somewhat logical counter point to the above argument is that:

  • Since Stochastic Gradient Descent approximates the gradient of the function whereas Gradient Descent evaluates the full gradient - one would imagine that perhaps Gradient Descent might have a better ability to find the true minimum of the function compared to Gradient Descent. However, I am not sure about this.

My Question: In the past few years, a lot of research has been done about studying the behavior (e.g. theoretical convergence properties) of Stochastic Gradient Descent (e.g. https://arxiv.org/abs/2103.14350 , https://raghumeka.github.io/CS289ML/gdnotes.pdf) which have demonstrated similar abilities of the Stochastic Gradient Descent Algorithm to converge as compared to Gradient Descent.

But are there any theoretical results which expose the above (speculated) tradeoff? At the end of the day, if Stochastic Gradient requires less computing power when compared to Gradient Descent - are there any theoretical results that suggest Gradient Descent has an inherent ability to better find the minimum of the function (e.g. perhaps less likely to get "stuck" in saddle points) since it is not relying on an approximation of the gradient? Or if Stochastic Gradient Descent is equal to Gradient Descent in this regard - could Stochastic Gradient Descent be then considered as superior to Gradient Descent?

Can someone please comment on this?

Thanks!

$\endgroup$
1
  • $\begingroup$ Don't ask a question that will lead to opinions. Ask "Is X superior to Y?" will most likely lead to opinions. Rephrase your question so that it will lead to objective answers. $\endgroup$ Commented Feb 15, 2022 at 9:38

1 Answer 1

3
$\begingroup$

Stochastic Gradient Descent empirically has shown to lead better results than classic Gradient Descent since its formulation, and today we're getting close to understand that it's not just luck, but the results of better mathematical qualities as well. SO the answer to your question is no, the theoretical results also show that SGD is better than classic GD.

Check these papers offer different explanations about why SGD tend to perform better that classic GD:

  1. Train faster, generalize better: Stability of stochastic gradient descent
  2. Escaping From Saddle Points – Online Stochastic Gradient for Tensor Decomposition
  3. An Alternative View: When Does SGD Escape Local Minima?

Quick summaries:

  1. SGD works better cause it's less prone to generalization error, or to put it the other way around, a more stable learner compare to GD, i.e. the solutions learned with SGD are less affected by small perturbations in the training data.
  2. SGD works better cause its randomness makes it less prone to get stuck in saddle points.
  3. SGD works better cause it's literally GD but applied to a smoother version (convolved with noise) of the objective function we want to minimize, so again, less risk to get stuck not only in saddle points, but bad local minima as well (see image below).

enter image description here

Extra note: it's not hard to understand conceptually how SGD can help escaping saddle points and local minima. When computing a single step using all gradients of all training instances like GD does, we don't leave much room for exploration of the loss function surface. So if our training instances point to a local minima that will be the solution found by GD. On the other hand, SGD compute gradients only on a batch of instances, and therefore each update is a suboptimal step toward the negative direction of the gradients. Being suboptmal, each step leave room for exploration, i.e. if an update step points to a local minima it might be that the next step point in a different direction, something that can't happen in GD since we perform a single update. Same conclusion, but different conceptualization comes from the interpretation of the last paper. They prove that SGD can be seen as GD (so again, single update step) performed on a loss function convolved with noise. Noise convolution has the effect of smoothing out the function (think as a metaphor to combine random frequencies, they are more likely to cancel each other out than boosting each other). Smoothing the loss function means precisely removing small local minima, as you can see in the last picture. The consequence is the same as in the first explanation, i.e. less chance of getting stuck in a local minima.

enter image description here

$\endgroup$
3
  • $\begingroup$ Thank you for your answer! I would be very interested in understanding why the added randomness and smoothness make it less likely to get stuck in saddle points! $\endgroup$ Commented Feb 15, 2022 at 14:55
  • 1
    $\begingroup$ I've updated the question $\endgroup$ Commented Feb 15, 2022 at 15:47
  • $\begingroup$ Thank you so much for your answer! If you have time, can you please take a look at this question? ai.stackexchange.com/questions/34389/… thank you so much for all your help! $\endgroup$ Commented Feb 15, 2022 at 15:56

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.