Gradient-descent algorithm always converges to the closest local optima?

Question

Assume $f(\vec x)$, which is Lipschitz continuous, has two local optima $\vec x_1^*$ and $\vec x_2^*$( $\vec x_1^*$ is the global minimum). We start the gradient-descent algorithm from $\vec x_0$ and $\left \|\vec x_0 - \vec x_1^* \right \| \leq \left \|\vec x_0 - \vec x_2^* \right \|$. Additionally, we know between $\vec x_0$ and $\vec x_1^*$, $f(\vec x)$ is convex, but between $x_1^*$ and $x_2^*$, $f(\vec x)$ is not convex.

Can we say initialized at $\vec x_0$ using gradient-descent algorithm and it always converge to $\vec x_1^*$ rather than $\vec x_2^*$? Is there any THM to support this? Is there any other way to do this?

You have tagged this as convex-optimization, which I suspect affects the answer. Your question doesn't mention whether or not the function must be convex though. — Davis Yoshida
– Davis Yoshida, Commented Apr 26, 2014 at 22:06
It depends on the choice of step size. If the step size is very large you might end up at $\vec x^*_2$ on the first step. — user856
– user856, Commented Apr 26, 2014 at 22:21
Seeing as you've started a bounty, can you elaborate on what you're looking for? — user856
– user856, Commented Jul 4, 2014 at 7:30

user2566092 · Accepted Answer · 2014-04-26 22:30:57Z

The distance between pairs of points doesn't have anything to do locally with the directional derivative of the function as it varies in between trajectories from $x_0$ to the two points $x_1,x_2$. You could have $||x_0 - x_1|| < ||x_0 - x_2||$ and yet the local gradient at $x_0$ says to go toward point $x_2$ instead of $x_1$. And convexity/lack of convexity in the middle range between $x_0$ and $x_1,x_2$ has nothing to do with it either, because gradient descent is a local method that just relies on gradients and not second derivatives. You could very well converge to $x_2$ with gradient descent. Maybe not with more sophisticated methods.

but if I use gradient descent from $x_0$ with very small step size, because the convexity between $x_0$ and $x_1$, $f(x)$ must converge to $x_1$, right? — E.J.
– E.J., Commented Apr 26, 2014 at 22:35
@user137273 No, I don't believe that's true. The directional derivative could locally be greater headed toward point $x_2$ instead of $x_1$, even though the function fails to be always convex in between $x_0$ and $x_2$. — user2566092
– user2566092, Commented Apr 26, 2014 at 22:36
@2566092 what if $x_1$ is the global optima? In your opinion, if and only if the directional derivative toward $x_1$ is greater than $x_2$, my claim is correct, right? — E.J.
– E.J., Commented Apr 26, 2014 at 22:44

Surb · Accepted Answer · 2014-07-09 14:10:33Z

No, there is no guarantee that you will converge to $x_1$. This also depends on the choice of your step size. Let us recall that the gradient descent algorithm is defined by $$ \vec x^{k+1} = \vec x^k -\gamma_k \nabla f(\vec x^k)$$ where $\gamma_k$ is the step size. First let us show that if you choose a constant step size then you can always find a counter example. So suppose $\gamma_k = \gamma$ for every $k$. Now consider the function $$f(x)= \left\{ \begin{array}{ll} \frac{2}{\gamma}x^2 & \text{if }x \leq 1\\ \frac{2}{\gamma} &\text{else } \end{array}\right. \quad \text{ then }\quad f'(x)= \left\{ \begin{array}{ll} \frac{4}{\gamma}x & \text{if }x \leq 1 \\ 0 &\text{else } \end{array}\right.$$ and the points $x_0=-\frac{2}{3} ,x_1=0, x_2 = 2$ (so that $|x_0-x_1| < |x_0-x_2|$). In particular the first step of the Gradient descent would be $$ x_0-\gamma f'(x_0) = -\frac{2}{3} + \frac{4}{\gamma}\gamma \frac{2}{3} = 2 = x_2. $$ And it will get stuck on $x_2$ since $f'(x_2)=0$ so the next step would be $x_2 -\gamma f'(x_2) = x_2$, and so on. In the picture below you can find the plot of $f$ for $\gamma = 2$.

$example of $f$ with $\gamma = 2$$ Now one could argue that the choice of a constant step size is a bad choice and setting $$\gamma_k = \arg\min_{\gamma \geq 0} f(\vec x^k-\gamma \nabla f(\vec x^k)),\qquad (*)$$ would be better (i.e. we choose the step to optimize the descent). However this is not always true! It is shown in the lecture notes of Convex Optimization of M. Hein that the function $$ f(x,y) = \left\{ \begin{array}{ll} 5\sqrt{9x^2+16y^2} & \text{if }x >|y| \\ 9x+16|y| &\text{else } \end{array}\right.$$ is pathological enough so that if you start the Gradient Descent algorithm with steepest descent (i.e. $\gamma^k$ fulfills $(*)$ at each step) with $(1,\left(\frac{10}{16}\right)^2)$ then the method will converge to $(0,0)$ which is not even a local minima. While the choice of a constant step size $\gamma $ small enough will (slowly) tend to $(-\infty, 0)$.

Note: Very complete informations about the Gradient Descent algorithm (and its "applications limit") can be found in the book Convex Optimization by S. Boyd and L. Vandenberghe.

So your suggestion is to set the step size small. But even though there still is no theoretically guarantee of converging to x1, right? — E.J.
– E.J., Commented Jul 10, 2014 at 14:49
Hi @Surb, could you please have a look at my related question here? — Akira
– Akira, Commented Feb 29, 2020 at 16:49

Stack Exchange Network

Gradient-descent algorithm always converges to the closest local optima?

2 Answers 2

You must log in to answer this question.

Linked

Hot Network Questions

Gradient-descent algorithm always converges to the closest local optima?

2 Answers 2

You must log in to answer this question.

Linked

Related

Hot Network Questions