Weakening uniqueness conditions for gradient descent

Question

Let $f : \mathbb{R}^n \to \mathbb{R}$ be a coercive and strictly convex function.

I know that if $f \in \mathcal{C}^1$ (i.e., the first derivatives of $f$ are continuous), then for any initial guess $x_0 \in \mathbb{R}^n$, the gradient descent algorithm converges to the unique global minimizer $x^*$ of $f$. However, it is possible for gradient descent to converge to global minimizers of functions which are not $\mathcal{C}^1$ (e.g., $f = \|\cdot\|_1$).

Are there more general conditions on $f$ that guarantee unique convergence of gradient descent when the first derivatives of $f$ are discontinuous on at most a set of measure zero (so that $f$ still admits subgradients on these points)?

What step size are you using in your gradient descent method? Are you assuming that f is strictly or strongly convex? — Brian Borchers
– Brian Borchers, Commented Sep 27, 2017 at 18:16
AFAIK, step size does not matter in the first case I described. It would be interesting to know if that changes if $f$ only belongs to an appropriate Sobolev space rather than $\mathcal{C}^1$. It's not my intent to assume strict or strong convexity, as I'm not sure if those are necessary to guarantee convergence [to the unique minimizer] (though I'd fathom that strict convexity is sufficient). — bashfuloctopus
– bashfuloctopus, Commented Sep 27, 2017 at 18:21
Your understanding is incorrect. For example, f(x)=max(0,x^2-1) is convex and coercive, but the minimum isn't unique, and you can't control which of the minimum points in [-1,1] you'll end up at. Also, if you try to minimize f(x)=x^2 starting at x(0)=1 and using x(n+1)=x(n)-f'(x(n)), you'll get x(1)=-1, x(2)=1, x(3)=-1, ... — Brian Borchers
– Brian Borchers, Commented Sep 27, 2017 at 20:50
Another example if f(x)=abs(x). If you use the subgradient g(x)=1 (if x>0) and g(x)=-1 (if x<0), and iterate using x(n+1)=x(n)-2g(x(n)) you'll get x(2)=-1, x(3)=1, ... If you switch to x(n+1)=x(n)-epsilon*g(x(n)) and start with x(0)=epsilon/2, you'll get a similar failure to converge. — Brian Borchers
– Brian Borchers, Commented Sep 27, 2017 at 20:53
Would it be correct to say the following? Let $f$ be strictly convex, coercive and $L$-Lipschitz, $L > 0$ some constant. Then for any $x^0 \in \mathbb{R}^n$ there exists a sequence $(\alpha_k)_{k\geq 0}$ with $\alpha_k \geq 0$ for all $k$, such that the gradient descent algorithm $x^{k+1} = x^k - \alpha_k g_k$ for $g_k \in \partial f(x^k)$ converges to the unique minimizer $x^*$ of $f$. — bashfuloctopus
– bashfuloctopus, Commented Sep 30, 2017 at 22:02

bashfuloctopus · Accepted Answer · 2017-10-01 01:33:14Z

Theorem:
Let $f$ be a coercive, strictly convex and $L$-Lipschitz function, $L >0$. Denote by $\partial f(x)$ the subdifferential of $f$ at the point $x \in \mathbb R^n$. For all $x^0 \in \mathbb R^n$, there exists $(\alpha_k)_{k\geq 0}\subseteq \mathbb R$ with $\forall k, \alpha_k \geq 0$; $\sum_k \alpha_k^2 < \infty$; and $\sum_k \alpha_k = \infty$, such that the subgradient descent algorithm, $$ x^{k+1} := x^k - \alpha_k g^k, \qquad g^k \in \partial f(x^k) $$ converges to the optimal solution $x^*$ in the sense that $x^k_\mathrm{best} \xrightarrow{k\to \infty} x^*$ where $$ x^k_\mathrm{best} := \arg\min \{ f(w) : w = x_j, 0\leq j \leq k \}. $$

Proof.
The proof follows material similar to Boyd and Vandenberghe's Convex Optimization book. By definition, \begin{align*} \|x^* - x^{k+1}\|_2^2 &= \|x^k - x^*\|_2^2 - 2\alpha_k \langle g^k , x^k - x^*\rangle + \alpha_k^2 \|g^k\|_2^2 \\ &\leq \|x^k - x^*\|_2^2 - 2\alpha_k (f(x^k) - f(x^*)) + \alpha_k^2 \|g^k\|_2^2 \\ &\leq \|x^0 - x^*\|_2^2 - \sum_{j\leq k} \alpha_j (f(x^j) - f(x^*)) + \sum_{j \leq k} \alpha_j^2 \|g_j\|_2^2 \end{align*} Now, $\|x^{k+1} - x^*\|_2^2 \geq 0$ and $\|x^0 - x^*\|_2^2 = R$ for some $ R> 0$. Therefore, for $x^j_\mathrm{best}$ as defined above, it follows that \begin{align*} 2 (f(x^k_\mathrm{best}) - f(x^*)) \sum_{j\leq k} \alpha_j \leq 2 \sum_{j\leq k} (f(x^j) - f(x^*)) \leq R^2 + \sum_{j\leq k} \alpha_j^2 \|g^j\|_2^2 \end{align*} For $g \in \partial f(x)$ and $y \in \mathbb R^n$ with $f$ being $L$-Lipschitz, $|\langle g, x-y\rangle| \leq |f(x) - f(y)| \leq L \|x-y\|_2$. This, by definition of the operator norm for $\langle g, \cdot \rangle$ implies that $\|g\|_2 \leq L$. Therefore, the above expression can be rearranged as \begin{align*} f(x^k_\mathrm{best}) - f(x^*) \leq \frac{R^2 + L^2 \sum_{j\leq k} \alpha_j^2}{2\sum_{j\leq k} \alpha_j} \xrightarrow{k\to\infty} 0. \end{align*} It follows from strict convexity and coercivity of $f$ that $x^k_\mathrm{best} \xrightarrow{k\to\infty} x^*$.

Stack Exchange Network

Weakening uniqueness conditions for gradient descent

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Weakening uniqueness conditions for gradient descent

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions