2
$\begingroup$

Let $f : \mathbb{R}^n \to \mathbb{R}$ be a coercive and strictly convex function.

I know that if $f \in \mathcal{C}^1$ (i.e., the first derivatives of $f$ are continuous), then for any initial guess $x_0 \in \mathbb{R}^n$, the gradient descent algorithm converges to the unique global minimizer $x^*$ of $f$. However, it is possible for gradient descent to converge to global minimizers of functions which are not $\mathcal{C}^1$ (e.g., $f = \|\cdot\|_1$).

Are there more general conditions on $f$ that guarantee unique convergence of gradient descent when the first derivatives of $f$ are discontinuous on at most a set of measure zero (so that $f$ still admits subgradients on these points)?

$\endgroup$
5
  • $\begingroup$ What step size are you using in your gradient descent method? Are you assuming that f is strictly or strongly convex? $\endgroup$ Commented Sep 27, 2017 at 18:16
  • $\begingroup$ AFAIK, step size does not matter in the first case I described. It would be interesting to know if that changes if $f$ only belongs to an appropriate Sobolev space rather than $\mathcal{C}^1$. It's not my intent to assume strict or strong convexity, as I'm not sure if those are necessary to guarantee convergence [to the unique minimizer] (though I'd fathom that strict convexity is sufficient). $\endgroup$ Commented Sep 27, 2017 at 18:21
  • 1
    $\begingroup$ Your understanding is incorrect. For example, f(x)=max(0,x^2-1) is convex and coercive, but the minimum isn't unique, and you can't control which of the minimum points in [-1,1] you'll end up at. Also, if you try to minimize f(x)=x^2 starting at x(0)=1 and using x(n+1)=x(n)-f'(x(n)), you'll get x(1)=-1, x(2)=1, x(3)=-1, ... $\endgroup$ Commented Sep 27, 2017 at 20:50
  • 1
    $\begingroup$ Another example if f(x)=abs(x). If you use the subgradient g(x)=1 (if x>0) and g(x)=-1 (if x<0), and iterate using x(n+1)=x(n)-2g(x(n)) you'll get x(2)=-1, x(3)=1, ... If you switch to x(n+1)=x(n)-epsilon*g(x(n)) and start with x(0)=epsilon/2, you'll get a similar failure to converge. $\endgroup$ Commented Sep 27, 2017 at 20:53
  • $\begingroup$ Would it be correct to say the following? Let $f$ be strictly convex, coercive and $L$-Lipschitz, $L > 0$ some constant. Then for any $x^0 \in \mathbb{R}^n$ there exists a sequence $(\alpha_k)_{k\geq 0}$ with $\alpha_k \geq 0$ for all $k$, such that the gradient descent algorithm $x^{k+1} = x^k - \alpha_k g_k$ for $g_k \in \partial f(x^k)$ converges to the unique minimizer $x^*$ of $f$. $\endgroup$ Commented Sep 30, 2017 at 22:02

1 Answer 1

0
$\begingroup$

Theorem:
Let $f$ be a coercive, strictly convex and $L$-Lipschitz function, $L >0$. Denote by $\partial f(x)$ the subdifferential of $f$ at the point $x \in \mathbb R^n$. For all $x^0 \in \mathbb R^n$, there exists $(\alpha_k)_{k\geq 0}\subseteq \mathbb R$ with $\forall k, \alpha_k \geq 0$; $\sum_k \alpha_k^2 < \infty$; and $\sum_k \alpha_k = \infty$, such that the subgradient descent algorithm, $$ x^{k+1} := x^k - \alpha_k g^k, \qquad g^k \in \partial f(x^k) $$ converges to the optimal solution $x^*$ in the sense that $x^k_\mathrm{best} \xrightarrow{k\to \infty} x^*$ where $$ x^k_\mathrm{best} := \arg\min \{ f(w) : w = x_j, 0\leq j \leq k \}. $$

Proof.
The proof follows material similar to Boyd and Vandenberghe's Convex Optimization book. By definition, \begin{align*} \|x^* - x^{k+1}\|_2^2 &= \|x^k - x^*\|_2^2 - 2\alpha_k \langle g^k , x^k - x^*\rangle + \alpha_k^2 \|g^k\|_2^2 \\ &\leq \|x^k - x^*\|_2^2 - 2\alpha_k (f(x^k) - f(x^*)) + \alpha_k^2 \|g^k\|_2^2 \\ &\leq \|x^0 - x^*\|_2^2 - \sum_{j\leq k} \alpha_j (f(x^j) - f(x^*)) + \sum_{j \leq k} \alpha_j^2 \|g_j\|_2^2 \end{align*} Now, $\|x^{k+1} - x^*\|_2^2 \geq 0$ and $\|x^0 - x^*\|_2^2 = R$ for some $ R> 0$. Therefore, for $x^j_\mathrm{best}$ as defined above, it follows that \begin{align*} 2 (f(x^k_\mathrm{best}) - f(x^*)) \sum_{j\leq k} \alpha_j \leq 2 \sum_{j\leq k} (f(x^j) - f(x^*)) \leq R^2 + \sum_{j\leq k} \alpha_j^2 \|g^j\|_2^2 \end{align*} For $g \in \partial f(x)$ and $y \in \mathbb R^n$ with $f$ being $L$-Lipschitz, $|\langle g, x-y\rangle| \leq |f(x) - f(y)| \leq L \|x-y\|_2$. This, by definition of the operator norm for $\langle g, \cdot \rangle$ implies that $\|g\|_2 \leq L$. Therefore, the above expression can be rearranged as \begin{align*} f(x^k_\mathrm{best}) - f(x^*) \leq \frac{R^2 + L^2 \sum_{j\leq k} \alpha_j^2}{2\sum_{j\leq k} \alpha_j} \xrightarrow{k\to\infty} 0. \end{align*} It follows from strict convexity and coercivity of $f$ that $x^k_\mathrm{best} \xrightarrow{k\to\infty} x^*$.

$\endgroup$

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.