Why do we move in the negative direction of the gradient in Gradient Descent?

Question

It is said that backpropagation, with Gradient Descent, seeks to minimize a cost function using the formula:

$$ W_{new} = W_{old} - learningRate \cdot \frac{\partial E}{\partial W} $$

My question is, if the derivate indicates in which direction the function (the graph of the error with respect to the weights) is decreasing, then why subtract from an already negative gradient?

Why not allow the current direction of the gradient (negative lets say) to be the driving factor for updating the weights:

$$ W_{new} = W_{old} + learningRate \cdot (-gradient) $$

sai · Accepted Answer · 2020-10-01 09:59:26Z

Consider a simple example where the cost function to be a parabola $y=x^2$ which is convex(ideal case) with a one global minima at $x=0$

Here your $y$ is the independent variable and $x$ is the dependent variable, analogus to the weights of model that you are trying to learn.

This is how it would look like.

Let's apply gradient descent to this particular cost function(parabola) to find it's minima.

From calculus it is clear that $dy/dx = 2*x$. So that means that the gradients are positive in the $1^{st}$ quadrant and negative in the $2^{nd}$. That means for every positive small step in x that we take, we move away from origin in the $1^{st}$ quadrant and move towards the origin in the $2^{nd}$ quadrant(step is still positive).

In the update rule of gradient descent the '-' negative sign basically negates the gradient and hence always moves towards the local minima.

$1^{st}$ quadrant -> gradient is positive, but if you use this as it is you move away from origin or minima. So, the negative sign helps here.
$2^{nd}$ quadrant -> gradient is negative, but if you use this as it is you move away from origin or minima(addition of two negative values). So, the negative sign helps here.

Here is a small python code to make things clearer-

import numpy as np import matplotlib.pyplot as plt x = np.linspace(-4, 4, 200) y = x**2 plt.xlabel('x') plt.ylabel('y = x^2') plt.plot(x, y) # learning rate lr = 0.1 np.random.seed(20) x_start = np.random.normal(0, 2, 1) dy_dx_old = 2 * x_start dy_dx_new = 0 tolerance = 1e-2 # stop once the value has converged while abs(dy_dx_new - dy_dx_old) > tolerance: dy_dx_old = dy_dx_new x_start = x_start - lr * dy_dx_old dy_dx_new = 2 * x_start plt.scatter(x_start, x_start**2) plt.pause(0.5) plt.show()

Thank you for the kind answer. While I do see the point your making on this practical example, in theory it just seemed hard for me to extrapolate the idea to other functions. But i think this example suffices. — Kamal Raydan
– Kamal Raydan, Commented Oct 1, 2020 at 11:35
This may also help: the gradient $g=\partial f/\partial x$ accounts for the magnitude of change on $f$ given a little change on $x$. Thereby, if we call $\delta x$ the update that we decide to use $\rightarrow$ the change on $f$ would be given by $g^T\delta x$. The value that maximizes it is if $\delta x=\alpha g$ so in order to decrease our function the maximum as possible, we use $\delta x = -\alpha g$ — Javier TG
– Javier TG, Commented Oct 1, 2020 at 20:14

B.Koseoglu · Accepted Answer · 2022-02-21 03:03:00Z

Let $F : \mathbb{R}^{n} \rightarrow \mathbb{R}$ be a continuous differentiable function and $d \in \mathbb{R}^{n}$. Then $d$ is called a descent direction at position $p \in \mathbb{R}^{n}$, if there is a $R > 0 $ such that $F(p+rd) < F(p)$ for all $r \in (0,R)$.

In simple terms: If we move $p$ in direction of $d$ we can reduce the value of $F$.

Now $d$ is a descent direction at $p$, if $\nabla F(p)^T rd < 0 $ since by definition of gradient $F(p+rd) - F(p) = \nabla F(p)^Trd$ and in order to reduce the function value, we need to have $\nabla F(p)^T rd < 0 $:

For $f(r):= F(p+rd)$ we have $f'(t) = \nabla F(p+rd)^T d$. By assumption, $f'(0) < 0$ holds.

Since $f'(0) = \lim_{h \rightarrow 0} \frac{f(h)-f(0)}{h}$, we conclude that $d$ must be descent direction.

Therefore, setting $d := -\nabla F(p)$, we have $\nabla F(p)^T (-\nabla F(p)) = - ||\nabla F(p)||_{2}^{2} < 0 $, if $p$ is not a stationary point.

In particular, we can choose an $p' = p + r'd$ with $F(p') < F(p)$. This shows that using the negative gradient makes sense.

Luan Souza · Accepted Answer · 2024-10-12 00:23:00Z

The gradient indicates in which direction the function is increasing, not decreasing, as demonstrated by @sai.

Let's consider a loss function

$$ L(x) = x^2 $$

At the coordinate (1, 1) the derivative (slope) is positive, meaning that the function tends to increase at this point by that amount.

If you want to minimize the loss, you need to control the independent value x (representing the model parameters) to reduce the loss. Control here means "to increase" or "to decrease" the parameter value. So how can you know the best choice? The trick is to look at the gradients, that is, the derivative of the loss function related to the parameters.

If the gradient is positive, you need to reduce the parameter value aiming to reduce the loss.
If the gradient is negative, you need to increase the parameter value aiming to reduce the loss.

This means that, when updating the parameters you always get the opposite way of the gradient (i.e. add a negative sign to the derivative factor). A common parameter update formula would be

$$ w_i = w_i - \alpha \frac{\delta L}{\delta w_i}$$

The $\alpha$ factor, named learning rate, is just to take only a fraction of this gradient to update the parameter.

Ben · Accepted Answer · 2020-10-02 00:28:13Z

-1

Computing the gradient gives you the direction where the function increases the most. Consider f: x -> x^2 , the gradient in x=1 is 2, if you want to minimize the function you need to go in the direction of -2, same with x=-1 as the gradient is -2.

And as gradients are usually vectors, I don't know what a positive or negative gradient would be if gradient is something like (-1, 1).

https://builtin.com/data-science/gradient-descent

edited Oct 2, 2020 at 0:28

Ben

2,6623 gold badges17 silver badges30 bronze badges

answered Oct 1, 2020 at 10:02

mprouveur

3581 silver badge7 bronze badges

$\begingroup$ Same example, easy to understand! Why do you think this might not be the right place? $\endgroup$

sai
– sai

2020-10-01 10:05:11 +00:00
Commented Oct 1, 2020 at 10:05
$\begingroup$ the question is highly related with data-science as gradient descent/ascent is such a widely used tool, but considering this page datascience.stackexchange.com/help/on-topic I think it is more related to math stack exchange (minimization of a function) or ai stack exchange $\endgroup$

mprouveur
– mprouveur

2020-10-01 10:17:30 +00:00
Commented Oct 1, 2020 at 10:17
$\begingroup$ That's really just a minor remark I could/should have put it in the end of my answer I guess $\endgroup$

mprouveur
– mprouveur

2020-10-01 10:18:05 +00:00
Commented Oct 1, 2020 at 10:18
$\begingroup$ I understand... $\endgroup$

sai
– sai

2020-10-01 10:48:04 +00:00
Commented Oct 1, 2020 at 10:48
3

$\begingroup$ No need to apologize, I might be wrong and it is always better to ask a question and be redirected if needed rather than not asking it in the first place ;) $\endgroup$

mprouveur
– mprouveur

2020-10-01 11:44:59 +00:00
Commented Oct 1, 2020 at 11:44

| Show 1 more comment

Stack Exchange Network

Why do we move in the negative direction of the gradient in Gradient Descent?

4 Answers 4

Hot Network Questions

Why do we move in the negative direction of the gradient in Gradient Descent?

4 Answers 4

Related

Hot Network Questions