2
$\begingroup$

In gradient descent, I know that local minima occur when the derivative of a function is zero, but when the loss function is used, the derivative is equal to zero only when the output and the predicted output are the same (according to the equation below).

So, when the predicted output equals the output, that means the global minima is reached! So, my question is: How can a local minima occur, if zero gradient occurs only for the "perfect" fit?

$$\theta_j := \theta_j - {\alpha \over m} \sum_{i=1}^M (\hat y^i-y^i)x_j^i$$

$\endgroup$

2 Answers 2

3
$\begingroup$

The equation you used for gradient descent isn't general; it's specific for linear regression.
In linear regression, there is indeed only a single global minimum and no local minima; but for more complex models, the loss function is more complex, and local minima are possible.

$\endgroup$
4
  • $\begingroup$ but in backpropagation(neural network), also derivative equals to zero when expected output=target as in equation(non-linear sigmoid function): visualstudiomagazine.com/articles/2017/06/01/~/media/ECG/… Does this case also have single global minima? $\endgroup$ Commented Jul 27, 2020 at 10:07
  • $\begingroup$ output=target for each term in the sum is not the only way that the sum can equal zero. $\endgroup$ Commented Jul 27, 2020 at 10:14
  • $\begingroup$ Suppose it a single example for training will lead to having single minima, so in stochastic gradient descent, there will be no local minima! $\endgroup$ Commented Jul 27, 2020 at 10:42
  • $\begingroup$ A single example can't be used for training if there's more than one parameter in the model... $\endgroup$ Commented Jul 29, 2020 at 5:56
0
$\begingroup$

The premise of “no minimum without a perfect fit” is incorrect.

Let's look at a simple example with square loss.

$$L(\hat{y}, y) = \sum_i (y_i-\hat{y}_i)^2$$

$$ (x_1, y_1) = (0,1)$$ $$ (x_2, y_2) = (1,2)$$ $$ (x_3, y_3) = (3,3)$$

We decide to model this with a line: $\hat{y}_i = \beta_0 + \beta_1 x_i$.

Let's optimize the parameters according to the loss function.

$$L(\hat{y}, y) = (1-(\beta_0 + \beta_1(0)))^2 + (2-(\beta_0 + \beta_1(1)))^2 + (3-(\beta_0 + \beta_1(3)))^2$$

Now we take the partial derivatives of $L$ with respect to $\beta_0$ and $\beta_1$ and do the usual calculus of minimization.

So we minimize the loss function, but we certainly do not have a perfect fit with a line.

$\endgroup$
4
  • $\begingroup$ This loss function does not have any local minima though... $\endgroup$ Commented Jul 28, 2020 at 18:14
  • $\begingroup$ The critical point is $(\hat{\beta}_0, \hat{beta}_1) = (\frac{8}{7}, \frac{9}{14})$. The eigenvalues of the Hessian matrix at $(\frac{8}{7}, \frac{9}{14})$ are $13\pm\sqrt{113}$, both of which are $>0$, making the critical point a minimum. $\endgroup$ Commented Jul 28, 2020 at 22:31
  • $\begingroup$ Yes, of course it has a minimum - but it's a global minimum. OP asked about local minima other than the global minimum. $\endgroup$ Commented Jul 29, 2020 at 5:55
  • 1
    $\begingroup$ The idea that the gradient is zero only for the perfect fit is incorrect. $\endgroup$ Commented Jul 29, 2020 at 10:26

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.