3
$\begingroup$

It is said that backpropagation, with Gradient Descent, seeks to minimize a cost function using the formula:

$$ W_{new} = W_{old} - learningRate \cdot \frac{\partial E}{\partial W} $$

My question is, if the derivate indicates in which direction the function (the graph of the error with respect to the weights) is decreasing, then why subtract from an already negative gradient?

Why not allow the current direction of the gradient (negative lets say) to be the driving factor for updating the weights:

$$ W_{new} = W_{old} + learningRate \cdot (-gradient) $$

$\endgroup$

4 Answers 4

2
$\begingroup$

Consider a simple example where the cost function to be a parabola $y=x^2$ which is convex(ideal case) with a one global minima at $x=0$

Here your $y$ is the independent variable and $x$ is the dependent variable, analogus to the weights of model that you are trying to learn.

This is how it would look like.

enter image description here

Let's apply gradient descent to this particular cost function(parabola) to find it's minima.

From calculus it is clear that $dy/dx = 2*x$. So that means that the gradients are positive in the $1^{st}$ quadrant and negative in the $2^{nd}$. That means for every positive small step in x that we take, we move away from origin in the $1^{st}$ quadrant and move towards the origin in the $2^{nd}$ quadrant(step is still positive).

In the update rule of gradient descent the '-' negative sign basically negates the gradient and hence always moves towards the local minima.

  • $1^{st}$ quadrant -> gradient is positive, but if you use this as it is you move away from origin or minima. So, the negative sign helps here.
  • $2^{nd}$ quadrant -> gradient is negative, but if you use this as it is you move away from origin or minima(addition of two negative values). So, the negative sign helps here.

Here is a small python code to make things clearer-

import numpy as np import matplotlib.pyplot as plt x = np.linspace(-4, 4, 200) y = x**2 plt.xlabel('x') plt.ylabel('y = x^2') plt.plot(x, y) # learning rate lr = 0.1 np.random.seed(20) x_start = np.random.normal(0, 2, 1) dy_dx_old = 2 * x_start dy_dx_new = 0 tolerance = 1e-2 # stop once the value has converged while abs(dy_dx_new - dy_dx_old) > tolerance: dy_dx_old = dy_dx_new x_start = x_start - lr * dy_dx_old dy_dx_new = 2 * x_start plt.scatter(x_start, x_start**2) plt.pause(0.5) plt.show() 

enter image description here

$\endgroup$
2
  • $\begingroup$ Thank you for the kind answer. While I do see the point your making on this practical example, in theory it just seemed hard for me to extrapolate the idea to other functions. But i think this example suffices. $\endgroup$ Commented Oct 1, 2020 at 11:35
  • $\begingroup$ This may also help: the gradient $g=\partial f/\partial x$ accounts for the magnitude of change on $f$ given a little change on $x$. Thereby, if we call $\delta x$ the update that we decide to use $\rightarrow$ the change on $f$ would be given by $g^T\delta x$. The value that maximizes it is if $\delta x=\alpha g$ so in order to decrease our function the maximum as possible, we use $\delta x = -\alpha g$ $\endgroup$ Commented Oct 1, 2020 at 20:14
1
$\begingroup$

Let $F : \mathbb{R}^{n} \rightarrow \mathbb{R}$ be a continuous differentiable function and $d \in \mathbb{R}^{n}$. Then $d$ is called a descent direction at position $p \in \mathbb{R}^{n}$, if there is a $R > 0 $ such that $F(p+rd) < F(p)$ for all $r \in (0,R)$.

In simple terms: If we move $p$ in direction of $d$ we can reduce the value of $F$.

Now $d$ is a descent direction at $p$, if $\nabla F(p)^T rd < 0 $ since by definition of gradient $F(p+rd) - F(p) = \nabla F(p)^Trd$ and in order to reduce the function value, we need to have $\nabla F(p)^T rd < 0 $:

For $f(r):= F(p+rd)$ we have $f'(t) = \nabla F(p+rd)^T d$. By assumption, $f'(0) < 0$ holds.

Since $f'(0) = \lim_{h \rightarrow 0} \frac{f(h)-f(0)}{h}$, we conclude that $d$ must be descent direction.

Therefore, setting $d := -\nabla F(p)$, we have $\nabla F(p)^T (-\nabla F(p)) = - ||\nabla F(p)||_{2}^{2} < 0 $, if $p$ is not a stationary point.

In particular, we can choose an $p' = p + r'd$ with $F(p') < F(p)$. This shows that using the negative gradient makes sense.

$\endgroup$
1
$\begingroup$

The gradient indicates in which direction the function is increasing, not decreasing, as demonstrated by @sai.

Let's consider a loss function

$$ L(x) = x^2 $$

enter image description here

At the coordinate (1, 1) the derivative (slope) is positive, meaning that the function tends to increase at this point by that amount.

If you want to minimize the loss, you need to control the independent value x (representing the model parameters) to reduce the loss. Control here means "to increase" or "to decrease" the parameter value. So how can you know the best choice? The trick is to look at the gradients, that is, the derivative of the loss function related to the parameters.

  • If the gradient is positive, you need to reduce the parameter value aiming to reduce the loss.
  • If the gradient is negative, you need to increase the parameter value aiming to reduce the loss.

This means that, when updating the parameters you always get the opposite way of the gradient (i.e. add a negative sign to the derivative factor). A common parameter update formula would be

$$ w_i = w_i - \alpha \frac{\delta L}{\delta w_i}$$

The $\alpha$ factor, named learning rate, is just to take only a fraction of this gradient to update the parameter.

$\endgroup$
-1
$\begingroup$

Computing the gradient gives you the direction where the function increases the most. Consider f: x -> x^2 , the gradient in x=1 is 2, if you want to minimize the function you need to go in the direction of -2, same with x=-1 as the gradient is -2.

And as gradients are usually vectors, I don't know what a positive or negative gradient would be if gradient is something like (-1, 1).

https://builtin.com/data-science/gradient-descent

$\endgroup$
6
  • $\begingroup$ Same example, easy to understand! Why do you think this might not be the right place? $\endgroup$ Commented Oct 1, 2020 at 10:05
  • $\begingroup$ the question is highly related with data-science as gradient descent/ascent is such a widely used tool, but considering this page datascience.stackexchange.com/help/on-topic I think it is more related to math stack exchange (minimization of a function) or ai stack exchange $\endgroup$ Commented Oct 1, 2020 at 10:17
  • $\begingroup$ That's really just a minor remark I could/should have put it in the end of my answer I guess $\endgroup$ Commented Oct 1, 2020 at 10:18
  • $\begingroup$ I understand... $\endgroup$ Commented Oct 1, 2020 at 10:48
  • 3
    $\begingroup$ No need to apologize, I might be wrong and it is always better to ask a question and be redirected if needed rather than not asking it in the first place ;) $\endgroup$ Commented Oct 1, 2020 at 11:44

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.