Why is gradient the direction of steepest ascent?

Question

$$f(x_1,x_2,\dots, x_n):\mathbb{R}^n \to \mathbb{R}$$ The definition of the gradient is $$ \frac{\partial f}{\partial x_1}\hat{e}_1 +\ \cdots +\frac{\partial f}{\partial x_n}\hat{e}_n$$

which is a vector.

Reading this definition makes me consider that each component of the gradient corresponds to the rate of change with respect to my objective function if I go along with the direction $\hat{e}_i$.

But I can't see why this vector (defined by the definition of the gradient) has anything to do with the steepest ascent.

Why do I get maximal value again if I move along with the direction of gradient?

AsinglePANCAKE · Accepted Answer · 2017-12-24 02:45:10Z

Each component of the gradient tells you how fast your function is changing with respect to the standard basis. It's not too far-fetched then to wonder, how fast the function might be changing with respect to some arbitrary direction? Letting $\vec v$ denote a unit vector, we can project along this direction in the natural way, namely via the dot product $\text{grad}( f(a))\cdot \vec v$. This is a fairly common definition of the directional derivative.

We can then ask in what direction is this quantity maximal? You'll recall that $$\text{grad}( f(a))\cdot \vec v = |\text{grad}( f(a))|| \vec v|\text{cos}(\theta)$$

Since $\vec v$ is unit, we have $|\text{grad}( f)|\text{cos}(\theta)$, which is maximal when $\cos(\theta)=1$, in particular when $\vec v$ points in the same direction as $\text{grad}(f(a))$.

I was wondering, but how to you that grad(f(a)) gives the steepest change? How do you know there is not other vector that moving in its direction might lead to a steeper change? — Charlie Parker
– Charlie Parker, Commented Feb 25, 2014 at 1:51
so we multiply the gradient with an arbitrary vector and then say that the product is maximum when the vector points in the same direction as the gradient? How does that answer the question? — novice
– novice, Commented Sep 3, 2016 at 0:30
@novice It answers the question since the scalar product EQUALS the rate of change of $f$ along the direction of the unit vector. So, in maximizing the product, the rate of increase in $f$ is likewise maximized. — Mark Viola
– Mark Viola, Commented Sep 3, 2016 at 1:16
Using the 'geometric definition' of the dot product was a boss move — Prince M
– Prince M, Commented May 24, 2019 at 21:05

MGA · Accepted Answer · 2015-05-06 15:46:08Z

Other answers are correct in using the directional derivative to show that the gradient is the direction of steepest ascent/descent. However, I think it is instructive to look at the definition of the directional derivative from first principles to understand why this is so (it is not arbitrarily defined to be the dot product of the gradient and the directional vector).

Let $f(\mathbf{x}):\mathbb{R}^n \rightarrow \mathbb{R}$. The partial derivatives of $f$ are the rates of change along the basis vectors of $\mathbf{x}$:

$\textrm{rate of change along }\mathbf{e}_i = \lim_{h\rightarrow 0} \frac{f(\mathbf{x} + h\mathbf{e}_i)- f(\mathbf{x})}{h} = \frac{\partial f}{\partial x_i}$

Each partial derivative is a scalar. It is simply a rate of change.

The gradient of $f$ is then defined as the vector:

$\nabla f = \sum_{i} \frac{\partial f}{\partial x_i} \mathbf{e}_i$

We can naturally extend the concept of the rate of change along a basis vector to a (unit) vector pointing in an arbitrary direction. Let $\mathbf{v}$ be such a vector, i.e., $\mathbf{v} = \sum_{i} \alpha_i \mathbf{e}_i$ where $\sum_{i} \alpha_i^2 = 1$. Then:

$\textrm{rate of change along }\mathbf{v} = \lim_{h\rightarrow 0} \frac{f(\mathbf{x} + h\mathbf{v}) - f(\mathbf{x})}{h}$

Again, this quantity is a scalar.

Now, it can be proven that if $f$ is differentiable at $\mathbf{x}$, the limit above evaluates to: $(\nabla f) \cdot \mathbf{v}$. This is a dot product of two vectors, which returns a scalar.

We know from linear algebra that the dot product is maximized when the two vectors point in the same direction. This means that the rate of change along an arbitrary vector $\mathbf{v}$ is maximized when $\mathbf{v}$ points in the same direction as the gradient. In other words, the gradient corresponds to the rate of steepest ascent/descent.

Are you missing a square root around $\sum_i \alpha_i^2$? The way I understand it, what you're trying to say is that the magnitude of $v$ should be 1, which only happens if the square root of the sum of alphas squared is 1 (I could easily be completely wrong, my apologies if that's the case). I like your explanation overall regardless. — jeremy radcliff
– jeremy radcliff, Commented Sep 24, 2016 at 23:08
@jeremyradcliff Yes exactly, I'm saying the magnitude should be 1. I left out the square root precisely because $1^2 =1$. — MGA
– MGA, Commented Sep 26, 2016 at 16:23
I like this anwer a lot, and my intuition also was, that the gradient points in the direction of greatest change. But that is not the same as ascent. E.g. in the gradient descent algorithm one always uses the negative gradient, suggesting ascent but not descent. Can you explain why the gradient always points to the greatest ascent? — lpnorm
– lpnorm, Commented Oct 21, 2021 at 15:46

Jonathan · Accepted Answer · 2012-10-29 04:12:55Z

45

Consider a Taylor expansion of this function, $$f({\bf r}+{\bf\delta r})=f({\bf r})+(\nabla f)\cdot{\bf\delta r}+\ldots$$ The linear correction term $(\nabla f)\cdot{\bf\delta r}$ is maximized when ${\bf\delta r}$ is in the direction of $\nabla f$.

answered Oct 29, 2012 at 4:12

Jonathan

8,6182 gold badges26 silver badges40 bronze badges

5

$\begingroup$ Hi! I'm sorry, but could you please clarify why maximizing the linear term ammounts to finding the direction of greatest increase? Is this an informal argument, along the lines of "the linear term of the Taylor series is the most dominant, so if we want to maximize $f(r+\delta r),$ we should maximize the linear term"? Thanks $\endgroup$

Ovi
– Ovi

2020-05-10 05:34:11 +00:00
Commented May 10, 2020 at 5:34
$\begingroup$ Agree with Ovi. This gives some intuition but doesn't prove anything $\endgroup$

JobHunter69
– JobHunter69

2023-07-16 19:55:48 +00:00
Commented Jul 16, 2023 at 19:55
2

$\begingroup$ underrated. + 1 $\endgroup$

mick
– mick

2023-11-10 12:10:48 +00:00
Commented Nov 10, 2023 at 12:10
1

$\begingroup$ This should be the correct answer. $\endgroup$

Moher
– Moher

2024-04-16 20:27:15 +00:00
Commented Apr 16, 2024 at 20:27

Add a comment |

BobaFret · Accepted Answer · 2012-10-29 04:22:16Z

The question you're asking can be rephrased as "In which direction is the directional derivative $\nabla_{\hat{u}}f$ a maximum?".

Assuming differentiability, $\nabla_{\hat{u}}f$ can be written as:

$$\nabla_{\hat{u}}f = \nabla f(\textbf{x}) \cdot \hat{u} =|\nabla f(\textbf{x})||\hat{u}|\cos \theta = |\nabla f(\textbf{x})|\cos \theta$$

which is a maximum when $\theta =0$: when $\nabla f(\textbf{x})$ and $\hat{u}$ are parallel.

Shai · Accepted Answer · 2020-05-16 20:12:33Z

I know this is an old question, and it already has many great answers, but I still think there is more geometric intuition that can be added.

In this answer, we consider for simplicity the surface $z = f(x,y)$ and imagine taking the gradient of $z$ at the origin. Let the $xy$-plane be $\Pi$ and let the tangent plane to the surface at the origin by $\Pi'$.

Now, let $$ \vec{D_x} = \left( \begin{array}{c} 1 \\ 0 \\ \partial z / \partial x \end{array} \right), \quad \vec{D_y} = \left( \begin{array}{c} 0 \\ 1 \\ \partial z / \partial y \end{array} \right) $$ be the tangent vectors in the $x$ and $y$ directions (i.e. the basis of $\Pi'$). Then the normal to $\Pi'$ by the cross product is $$ \vec{n} = \left( \begin{array}{c} - \partial z / \partial x \\ - \partial z / \partial y \\ 1 \end{array} \right) $$ How did the $ \partial z / \partial x $ from $\vec{Dx}$ get into the first component of $\vec{n}$? That becomes clear when you look at this picture, and imagine $\Pi$ rotating to become $\Pi'$ Note that I have drawn a surface with $\partial z / \partial y = 0$ just for simplicity. You will notice that the normal vector contains $ - \partial z / \partial x $ because $\vec{k}$ 'rotates' by that much in the $x$ direction to point along $\vec{n}$, a bit like turning a joystick to rotate $\Pi$ onto $\Pi'$. Notice also that this means the $y$-axis is the axis of rotation. With this simplified geometry, you can imagine why moving through the tangent plane in the direction of the $x$ axis gives the greatest change in $z$ (rotate $\vec{D_x}$ in a circle: the tip can only lose altitude).

If we nudge the curve up a bit with respect to $y$ (add some $\partial y / \partial z$) then $\vec{n}$ would be nudged away in the $y$ direction and the ideal direction would correspondingly get nudged towards us in the $y$ direction, as below.

And here is the picture from a different perspective with a unit circle in the tangent plane drawn, which hopefully helps further elucidate the relationship between the ideal direction and the values of $\partial z / \partial x$ and $\partial z / \partial y$ (i.e. $\nabla z$). I have removed the surface entirely.

The intuitions obviously break down in higher dimensions and we must finally surrender to analysis (Cauchy Schwarz or Taylor expansions) but in 3D at least we can get a sense of what the analysis is telling us.

I really want to like your answer because it is beautifully made, but I find it hard to understand what you are even doing. Can you add a short explication in the beginning as to what you actually want to show? — lpnorm
– lpnorm, Commented Oct 21, 2021 at 15:50

whateverguy · Accepted Answer · 2013-03-18 14:43:46Z

Each component of the derivative $$ \frac{\partial f}{\partial x_1}\ ... \frac{\partial f}{\partial x_n}$$ tells you how fast your function is changing with respect to the standard basis.
It's now possible to make a basetransformation to an orthogonal base with $ n-1 $ base Directions with $0$ ascent and the gradient direction. In such a base the gradient direction must be the steepest since any adding of other base directions adds length but no ascent.

For a 3 dimensional Vector space the base could look like this $$ \left( \left( \begin{matrix} \partial x_2 \\ -\partial x_1 \\ 0 \end{matrix} \right) \left( \begin{matrix} \partial x_1 \\ \partial x_2 \\ -\dfrac{(\partial x_1)²+(\partial x_2)²}{\partial x_3} \end{matrix} \right) \left( \begin{matrix} \partial x_1 \\ \partial x_2 \\ \partial x_3 \end{matrix} \right) \right) $$ By complete induction it can now be shown that such a base is constructable for an n-Dimensional Vector space. $$ \left( \left( \begin{matrix} \partial x_2 \\ -\partial x_1 \\ 0 \\ 0 \end{matrix} \right) \left( \begin{matrix} \color{blue}{\partial x_1 \\ \partial x_2} \\ -\dfrac{(\partial x_1)²+(\partial x_2)²}{\partial x_3} \\ 0 \end{matrix} \right) \left( \begin{matrix} \color{blue}{\partial x_1 \\ \partial x_2} \\ \color{green}{\partial x_3} \\ -\dfrac{(\partial x_1)²+(\partial x_2)²+(\partial x_3)²}{\partial x_4} \end{matrix} \right) \left(\begin{matrix} \color{blue}{\partial x_1 \\ \partial x_2} \\ \color{green}{\partial x_3} \\ \color{orange}{\partial x_4} \end{matrix} \right) \right) $$ One can see here that the first Basevector demands the first 2 Elements of the following Basevectors to be $\partial x_1$ & $\partial x_2$ because of the orthogonal condition,
similarly the 2nd vector demands all the 3rd elements of the following vectors to be $\partial x_3$
as does the 3rd vector for the 4th element them being $\partial x_4$.

If another dimension is added the n+1 Element of the n$th$ Vector needs to be $$-\dfrac{(\partial x_1)²+...+(\partial x_n)²}{\partial x_{n+1}}$$ to meet the $0$ ascension condition which in turn forces the new n+1$th$ Vector to be of the form $$\left(\begin{matrix}\partial x_1 \\ ... \\ \partial x_{n+1}\end{matrix}\right)$$ for it to be orthogonal to the rest.

XGS · Accepted Answer · 2018-08-10 17:59:03Z

Just want to further clarify why the gradient provides the steepest ascent (instead of descent) here. Any differentiable $f$ can be approximated by the linear tangent plane, i.e., $$f(\mathbf{x} + h \mathbf{v}) = f(\mathbf{x}) + h \, \nabla f(\mathbf{x})^T \mathbf{v} $$ as $h \rightarrow 0$ for any unit-length direction $\mathbf{v}$ with $\parallel \mathbf{v} \parallel =1.$ As $h \downarrow 0$, consider the amount of change $$ f(\mathbf{x} + h \mathbf{v}) - f(\mathbf{x}) = h \, \left\{ \, \nabla f(\mathbf{x})^T \mathbf{v} \right\} ~~\in~~ \left[ - h \, \parallel \nabla f(\mathbf{x}) \parallel, ~ h \, \parallel \nabla f(\mathbf{x}) \parallel \right] $$ by Cauchy-Swcharz inequality, which reaches its maximum (increase) $(h \, \parallel \nabla f(\mathbf{x}) \parallel)$ when $\mathbf{v} = \nabla f(\mathbf{x}) / \parallel \nabla f(\mathbf{x}) \parallel$ and its minimum (i.e., maximum decrease) $ (-h \, \parallel \nabla f(\mathbf{x}) \parallel) $ if $ \mathbf{v}= - \nabla f(\mathbf{x})/\parallel \nabla f(\mathbf{x}) \parallel$ (the negative gradient direction).

Tera · Accepted Answer · 2022-12-11 12:19:07Z

Let $\vec v$ be an arbitrary unit vector. Then the change of $f$ by moving in the direction of $v$, starting in point $a$, is given by $\text{grad}( f(a)) \cdot \vec v$. We want to find a $\vec v$ for which this inner product is maximal. For the inner product we have the Cauchy–Schwarz inequality $\vec a \cdot \vec b \leq \vert\vec a\vert\vert\vec b\vert$. Now the equality holds when $\vec v = \lambda \; \text{grad}(f(a))$, for some $\lambda \in \mathbb{R}$.

Isaac Lagaris · Accepted Answer · 2018-03-28 16:26:11Z

Let $v=\frac{s}{|s|}$ be a unit vector and assume that $v$ is a descent direction, i.e. $v^T\nabla f(x) <0$. Then $f(x+\lambda v)$ as a function of $\lambda$, describes how this function changes along the direction $v$.

The rate of descent at $x$ along $v$ is given by: $$ \frac{d}{d \lambda}f(x+\lambda v)|_{\lambda=0} = v^T \nabla f(x) =\frac{s^T}{|s|}\nabla f(x) \equiv \frac{s^T}{|s|}g$$ So we want to find the maximum of this quantity as a function of $s$. Differentiating the above wrt $s$ and setting it equal to zero, we get (noting that $\nabla_s|s| =\frac{s}{|s|}$): $g=(g^T v)v\equiv av$.

Taking the Euclidean norm: $|g|=|a||v|=|a| \Rightarrow a=\pm|g|$.

We choose the minus sign to satisfy that $v$ is descent. Hence the direction of the steepest descent is $$ v= \dfrac{1}{a}g = -\dfrac{g}{|g|}$$

Spectre · Accepted Answer · 2020-09-21 15:09:29Z

Sorry for posting so late, but I found that a few more details added to the first post made it easier for me to understand, so I thought about posting it here, also

Let $\vec{n}$ be a unit vector oriented in an arbitrary direction and $T(x_{0}, y_{0}, z_{0})$ a scalar function which describes the temperature at the point $(x_{0}, y_{0}, z_{0})$ in space. The directional derivative of $T$ along this direction would be $$\frac{\partial T}{\partial \vec{n}} = \nabla T \cdot \vec{n} = \| \nabla T \| cos(\theta)$$, where $\theta$ is the angle between the gradient vector and the unit vector $\vec{n}$.

Now, consider three cases:

$\theta =0$ - steepest increase In this case, $$\nabla T \cdot \vec{n} = \| \nabla T \|$$ Now multiply this equation by $\nabla T$ and you get $$ \| \nabla T \| ^{2} \vec{n} =\| \nabla T \| \nabla T $$, so if you divide by $ \| \nabla T \| ^{2}$, you get that $$ \vec{n}= \frac{\nabla T}{\| \nabla T \|}$$ Let's look at that for a moment: the direction in space ($\vec{n}$) for which you get the steepest increase ($\theta=0$) is in the same direction and has the same orientation as the gradient vector (since the multiplying factor is just a positive constant). That means that the gradient's orientation coincides with the direction of steepest increase (steepest increase because the directional derivative has the maximum value it can have)
$\theta=\pi$ - steepest decrease In this case you get $$ \vec{n}= -\frac{\nabla T}{\| \nabla T \|}$$ So the gradient's orientation is opposite to that of steepest decrease (steepest decrease because the directional derivative has the "most negative" value)
$\theta=\pi /2$ - no change Here you get that the dot product between the direction defined by $\vec{n}$ and the gradient's one is 0, so you have no change in the field (because the directional derivative is 0). Interesting, along the direction which is perpendicular to the gradient vector you have constant values for the scalar function, $T$. Which makes sense, since the gradient field is perpendicular to the contour lines

vonjd · Accepted Answer · 2021-01-12 10:05:52Z

To give some intuition why the gradient (technically the negative gradient) has to point in the direction of steepest descent I created the following animation.

It shows all of the points that can be reached by a vector of a given length and two variables $x$ and $y$ that are multiplied by a constant and summed up to give a very simple linear function (which give very simple directional derivatives).

I then vary the constants relative to each other: when the constant of $x$ goes up (down) the constant of $y$ goes down (up). The red area equals the highest point which means that you have the steepest descent from there.

As can be seen, this point varies smoothly with the proportion of the constants which represent the derivatives in each direction!

Only when one constant equals zero do we have a corner solution, when both constants are the same the red area is exactly in the middle. There is no good reason why the red area (= steepest descent) should jump around between those points.

This means that the gradient will always point in the direction of the steepest descent (nb: which is of course not a proof but a hand-waving indication of its behaviour to give some intuition only!)

For a little bit of background and the code for creating the animation see here: Why Gradient Descent Works (and How To Animate 3D-Functions in R).

Kenneth Glazer · Accepted Answer · 2019-05-07 23:06:18Z

I think the most intuitive explanation is the following: Draw one vector in the X direction. Draw another vector in the Y direction. Make each vector any length you want. Make X longer than Y or Y longer than X or make them the same length. Doesn't matter. Now draw the vector that represents the sum of those two vectors, using the rectangle method. This is key to the intuition. You have to draw a rectangle. The resulting vector (representing the sum of the X-direction vector and the Y-direction vector) goes from one corner of the rectangle to the opposite corner. Remember that the length of the resulting vector represents the magnitude, which in this context represents slope. Now look at the drawing and ask yourself: is there any vector within this rectangle, starting at the origin, that is longer than the diagonal one? No. Any vector you draw that starts at the origin and goes to another side of the rectangle is shorter than the diagonal one. Of course, you can draw a vector at the origin that is longer than the diagonal one, but only if that vector leaves the rectangle. But if it leaves the rectangle, then it's no longer operating within the constraint of the given X vector and Y vector; it's working with some different X vector and Y vector. I hope that helps. It took me a while to understand this intuitively and the image of a diagonal within a rectangle finally switched the lamp on.

It might help to mention what this has to do with the gradient, other than it being a vector. — snulty
– snulty, Commented May 7, 2019 at 23:31
Presumably your X and Y here are meant to represent the partial derivatives $\partial{f}/\partial{x}$ and $\partial{f}/\partial{y}$, and the vector you're drawing is meant to indicate the direction and length of a candidate step? Then it's not the length of that vector $(x, y)$ that you're trying to maximize, but the sum $x \cdot X + y \cdot Y$ (where $(x, y)$ is constrained to the unit circle), and the way to maximize this is given by the other answers here. — A_P
– A_P, Commented Aug 23, 2019 at 19:39

Jiahe Liu · Accepted Answer · 2022-01-25 01:47:57Z

The key is the linear approximation of the function $f$.

First you must realize that near a given point $(x_1, x_2, ..,x_n)$, the change of $f$ is dominated by its first order partial derivatives. It can be approximated by a plane near that point:

$$ \Delta f(x_1+\Delta x_1, .. , x_n+\Delta x_n)=\frac{\partial f}{\partial x_1}\Delta x_1 + ... + \frac{\partial f}{\partial x_n}\Delta x_n $$

which is just the dot product between the gradient vector $\nabla f$ and the "change vector" $(\Delta x_1, .., \Delta x_n)$. So for a given length of the the "change vector", $\Delta f$ is the greatest when the change vector is in the same direction as the gradient.

Joe Shmo · Accepted Answer · 2024-12-24 04:13:48Z

Let $v \in \Bbb R^n$ be unit. By the Cauchy-Schwarz inequality,

$$ \left | \nabla_v f(x) \right | = \left | \nabla f(x) \cdot v \right | \le \| \nabla f(x) \| \cdot \| v \| = \| \nabla f(x) \| \cdot 1 = \| \nabla f(x) \| $$ Recall that equality is achieved $\iff v = c \cdot \nabla f(x) = \dfrac {\nabla f(x)}{\|\nabla f(x)\|}$ (since $v$ is unit). So that $\nabla f(x)$ is the direction of steepest ascent.

Nick Alger · Accepted Answer · 2025-08-12 18:38:12Z

This confused me when I first took calculus. After some years, and learning more advanced concepts, I realized that the confusion boiled down to the way that vectors and linear functionals (linear maps that take a vector as input and return a scalar as output) are conflated in typical calculus classes. The derivative is naturally a linear functional. The gradient, as a vector, only makes sense as the Riesz representative of the derivative with respect to an inner product.

This answer is, perhaps, at a higher level than the question asks. However, this is the logical way to think about it that eventually made sense to me.

Definition 1: The first derivative, $Df(x)$, is the linear functional that best approximates $v \mapsto f(x + v) - f(x)$ for small $v$.

Definition 2: The gradient, $\nabla f(x)$, is the Riesz representative representative of the first derivative. That is, the vector satisfying $$ Df(x)v = (\nabla f(x), v) $$ for all vectors $v \in \mathbb{R}^n$, where $(\cdot, \cdot)$ is inner product.

Theorem 1: The gradient (as defined in Definition 2) points in the steepest direction.

Theorem 2: If the inner product $(\cdot,\cdot)$ is the dot product, then \begin{equation*} \nabla f(x) = \left(\frac{df}{dx_1}(x), \frac{df}{dx_2}(x), \dots, \frac{df}{dx_n}(x)\right). \end{equation*}

If you start with the formula for $\nabla f(x)$ in Theorem 2 as the definition (as posed in the question), you just turn around the logic a bit. From Theorem 2, $\nabla f(x)$ is the Riesz representation of $Df(x)$ under the dot product, and from Theorem 1, the Riesz representative of $Df(x)$ is the direction of steepest ascent.

There is another interesting thing this perspective illustrates: if you change the inner product, the gradient changes. However, the inner product defines the notion of distance in the domain, so changing the inner product affects the meaning of steepness. If you walk one meter horizontally and go up one meter vertically, that is pretty steep. If you walk one kilometer horizontally and go up one meter vertically, that is pretty flat. Somehow these two changes exactly cancel, so the gradient is still the direction of steepest ascent!

Proof of Theorem 1: The direction of steepest ascent is \begin{equation*} \max_{||v||=1} Df(x)v = (\nabla f(x), v), \end{equation*} and $(\nabla f(x), v)$ is maximized when $v=\nabla f(x)$ by the Cauchy-Schwarz inequality. $\square$

Proof of Theorem 2: If $e_i=(0,\dots,0,1,0,\dots,0)$ is the unit vector which has a one in the $i$th position and zeros elsewhere, then the limit definition of the derivative in 1D implies $$ \frac{df}{dx_i}(x) = Df(x)e_i. $$ The ith component of a vector is the dot product of that vector with the unit vector $e_i$. Hence, \begin{equation*} \left(\nabla f(x)\right)_i = (\nabla f(x), e_i) = Df(x) e_i = \frac{df}{dx_i}(x). ~\square \end{equation*}

Stack Exchange Network

Why is gradient the direction of steepest ascent?

15 Answers 15

You must log in to answer this question.

Linked

Hot Network Questions

Why is gradient the direction of steepest ascent?

15 Answers 15

You must log in to answer this question.

Linked

Related

Hot Network Questions