How do local minima occur in the equation of loss function?

Question

In gradient descent, I know that local minima occur when the derivative of a function is zero, but when the loss function is used, the derivative is equal to zero only when the output and the predicted output are the same (according to the equation below).

So, when the predicted output equals the output, that means the global minima is reached! So, my question is: How can a local minima occur, if zero gradient occurs only for the "perfect" fit?

$$\theta_j := \theta_j - {\alpha \over m} \sum_{i=1}^M (\hat y^i-y^i)x_j^i$$

Itamar Mushkin · Accepted Answer · 2020-07-27 09:02:55Z

3

The equation you used for gradient descent isn't general; it's specific for linear regression.
In linear regression, there is indeed only a single global minimum and no local minima; but for more complex models, the loss function is more complex, and local minima are possible.

answered Jul 27, 2020 at 9:02

Itamar Mushkin

1,1495 silver badges17 bronze badges

$\begingroup$ but in backpropagation(neural network), also derivative equals to zero when expected output=target as in equation(non-linear sigmoid function): visualstudiomagazine.com/articles/2017/06/01/~/media/ECG/… Does this case also have single global minima? $\endgroup$

AI_new2
– AI_new2

2020-07-27 10:07:09 +00:00
Commented Jul 27, 2020 at 10:07
$\begingroup$ output=target for each term in the sum is not the only way that the sum can equal zero. $\endgroup$

Itamar Mushkin
– Itamar Mushkin

2020-07-27 10:14:58 +00:00
Commented Jul 27, 2020 at 10:14
$\begingroup$ Suppose it a single example for training will lead to having single minima, so in stochastic gradient descent, there will be no local minima! $\endgroup$

AI_new2
– AI_new2

2020-07-27 10:42:36 +00:00
Commented Jul 27, 2020 at 10:42
$\begingroup$ A single example can't be used for training if there's more than one parameter in the model... $\endgroup$

Itamar Mushkin
– Itamar Mushkin

2020-07-29 05:56:27 +00:00
Commented Jul 29, 2020 at 5:56

Add a comment |

Dave · Accepted Answer · 2020-07-29 10:25:41Z

The premise of “no minimum without a perfect fit” is incorrect.

Let's look at a simple example with square loss.

$$L(\hat{y}, y) = \sum_i (y_i-\hat{y}_i)^2$$

$$ (x_1, y_1) = (0,1)$$ $$ (x_2, y_2) = (1,2)$$ $$ (x_3, y_3) = (3,3)$$

We decide to model this with a line: $\hat{y}_i = \beta_0 + \beta_1 x_i$.

Let's optimize the parameters according to the loss function.

$$L(\hat{y}, y) = (1-(\beta_0 + \beta_1(0)))^2 + (2-(\beta_0 + \beta_1(1)))^2 + (3-(\beta_0 + \beta_1(3)))^2$$

Now we take the partial derivatives of $L$ with respect to $\beta_0$ and $\beta_1$ and do the usual calculus of minimization.

So we minimize the loss function, but we certainly do not have a perfect fit with a line.

The critical point is $(\hat{\beta}_0, \hat{beta}_1) = (\frac{8}{7}, \frac{9}{14})$. The eigenvalues of the Hessian matrix at $(\frac{8}{7}, \frac{9}{14})$ are $13\pm\sqrt{113}$, both of which are $>0$, making the critical point a minimum. — Dave
– Dave, Commented Jul 28, 2020 at 22:31
Yes, of course it has a minimum - but it's a global minimum. OP asked about local minima other than the global minimum. — Itamar Mushkin
– Itamar Mushkin, Commented Jul 29, 2020 at 5:55
The idea that the gradient is zero only for the perfect fit is incorrect. — Dave
– Dave, Commented Jul 29, 2020 at 10:26

Stack Exchange Network

How do local minima occur in the equation of loss function?

2 Answers 2

Hot Network Questions

How do local minima occur in the equation of loss function?

2 Answers 2

Related

Hot Network Questions