How to perform gradient descent when there is large variation in the magnitude of the gradient in different directions near the minimum?

Question

Suppose we wish to minimize a function $f(\vec{x})$ via the gradient descent algorithm \begin{equation} \vec{x}_{n+1} = \vec{x}_n - \eta \vec{\nabla}f(\vec{x}_n) \end{equation} starting from some initial guess $\vec{x}_0$. This is a well-studied problem.

Now suppose that near its minimum the "landscape" represented by the function $f(\vec{x})$ has directions with a very high gradient as well as directions with a very low gradient. That is, $f(\vec{x})$ is nearly flat along certain directions near its minimum. As a toy example, I propose the problem of minimizing the function \begin{equation} f(x,y) = \frac{1}{2}\left(x^2 + \epsilon y^2\right) \qquad\qquad (1) \end{equation} for some very small value of $\epsilon$ such as $10^{-20}$. Although this has a unique minimum at $(x,y) = (0,0)$, it is nearly flat in the $y$ direction. Most gradient descent algorithms have as a stopping condition a criterion that either $\vec{x}$ or $f$ change by an amount less than some set threshold during a given step. Under such a criterion, minimization of (1) starting from some arbitrary initial guess will result in $x$ quickly zooming in on $0$, while $y$ changes barely at all from step to step. The algorithm would most likely trigger its stopping condition before $y$ approaches $0$, resulting in an essentially random final value for $y$.

Are there variants of the gradient descent algorithm designed to handle the case where there is large variation in the magnitude of the gradient in different directions near a minimum? Or am I overthinking this, and the solution is to simply choose a very, very small value as a threshold in the stopping criterion?

Are you looking for a formal convergence proofs, or a practical alternative algorithm? If it is the latter, I suggest looking into the conjugate gradient method — Troy Van Voorhis
– Troy Van Voorhis, Commented Aug 16 at 17:38
@TroyVanVoorhis Definitely a practical alternative algorithm. Can the conjugate gradient method be used despite the fact that my original problem is neither linear nor sparse? — John Barber
– John Barber, Commented Aug 16 at 17:47
@JohnBarber Yes, it works for non-linear and non-sparse problems. It only requires that you know both the current gradient and the previous step (or previous gradient). See for example this Wikipedia page: en.m.wikipedia.org/wiki/Nonlinear_conjugate_gradient_method . The basic idea is to assume the problem is locally a set of (nearly) linear equations at each iteration. One updates based on that linear optimization and then repeats, linearising around the new point. — Troy Van Voorhis
– Troy Van Voorhis, Commented Aug 16 at 19:51
You want to look into preconditioned gradient descent or a preconditioned quasi-Newton/Newton method — whpowell96
– whpowell96, Commented Aug 17 at 0:04

Stack Exchange Network

How to perform gradient descent when there is large variation in the magnitude of the gradient in different directions near the minimum?

0

You must log in to answer this question.

Hot Network Questions

How to perform gradient descent when there is large variation in the magnitude of the gradient in different directions near the minimum?

0

You must log in to answer this question.

Related

Hot Network Questions