Help proving a chain rule from total derivative chain rule

Question

Consider $f:\mathbb{R}^d\to\mathbb{R}$ and $g:\mathbb{R}\to\mathbb{R}^d$.

It is known that $$ \tag{*} (f\circ g)'(x) = \sum_{i=1}^d \partial_i f(g(x)) \cdot g'(x)^i $$

I would like to prove this from the chain rule for the total derivatives:

$$ \tag{**} D_x(f\circ g) = D_{g(x)}f \circ D_xg $$

I'm not sure how to rigorously proceed here. Intuitively I know the two total differentials in the total derivative chain rule can be represented as matrices and their composition will correspond to the multiplication in the desired partial derivative chain rule. But I'm not sure how to get there. How does the function composition get converted into a summation + multiplication?

Another related issue. The expression $(*)$ is a real number when evaluated at $x$. The expression $(**)$ is a linear map from $\mathbb{R}\to\mathbb{R}$. It's not to difficult to understand that the linear maps $\mathbb{R}\to\mathbb{R}$ (the dual space of $\mathbb{R}$) are isomorphic to $\mathbb{R}$ itself. But still, it's providing a little bit of a technical stumbling block for getting my desired result.

A note, the derivative language may even be overkill for my understanding. If instead we had linear functions $S:\mathbb{R}^d \to \mathbb{R}$ and $T:\mathbb{R}\to\mathbb{R}^d$ we can represent the composition $S\circ T$ by

$$ \sum_i S_iT^i $$

Where $S_i$ and $T^i$ are somehow the matrix components of these linear transformations. But my same questions remain, how is this correspondence made rigorous? How do we pass from function composition to multiplication and summation, and from a function from $\mathbb{R}\to\mathbb{R}$ to a number in $\mathbb{R}$?

Note, I suspect I'm looking for an answer involving inserting projection matrices somewhere, or resolutions of the identity. But I can't figure out exactly what I need...

$D_{g(x)}f$ is a matrix and so is $D_xg\,.$ What is then the $\circ$ between them? About the other related issue: what do you get when you add the argument $x$ to (**)? That is: what is $(D_{g(x)}f\circ D_xg)(x)\,?$ — Kurt G.
– Kurt G., Commented Feb 27, 2024 at 7:57
Here $D_{g(x)}f$ and $D_x g$ are the en.wikipedia.org/wiki/Total_derivative, so they are linear maps. Related to, but slightly different than matrices. That is what I'm struggling with. The $\circ$ is function composition. Related to, but again, slightly different than matrix multiplication. — Jagerber48
– Jagerber48, Commented Feb 27, 2024 at 7:59
Linear maps between $\mathbb R^m$ and $\mathbb R^n$ are matrices and their composion is the matrix multiplication. I recommend not to learn that stuff from Wikipedia. The best thing is to take lots of concrete examples and crunch those derivatives. A recent interesting post in this direction. — Kurt G.
– Kurt G., Commented Feb 27, 2024 at 8:22
@KurtG. Linear maps are functions between $\mathbb{R}^m$ and $\mathbb{R}^n$, defined without needing a choice of basis. Linear maps can be represented by matrices assuming bases have been selected. The matrix changes if the basis does, the linear map doesn't. I'm trying to understand the subtleties of this representation. — Jagerber48
– Jagerber48, Commented Feb 27, 2024 at 8:29
You do not have to remind me of the basics from linear algebra. Have fun with those subtleties. — Kurt G.
– Kurt G., Commented Feb 27, 2024 at 8:36

Nicholas Todoroff · Accepted Answer · 2024-02-27 15:59:29Z

$ \newcommand\R{\mathbb R} $Two things:

You have to recognize that the "total derivative" is best understood as a "differential". Meaning, if $F : \R \to \R$ then $F'$ is not the total derivative. Instead, $$ D_xF(h) = F'(x)h,\quad h\in\R. $$ The same remark can be made about $g : \R \to \R^d$.
Your desired chain rule is the inner product of the gradient of $f$ with the derivative of $g$: $$ (f\circ g)'(x) = \nabla f(g(x))\cdot g'(x). $$ The gradient can be defined by $$ \nabla f(x)\cdot h = D_xf(h),\quad h\in\R^d. $$

Putting these points together, we can see that (with $h \in \R$): $$ (f\circ g)'(x) = \nabla f(g(x))\cdot g'(x) $$$$ \iff (f\circ g)'(x)h = \nabla f(g(x))\cdot g'(x)h $$$$ \iff D_x[f\circ g](h) = D_{g(x)}f(g'(x)h) $$$$ \iff D_x[f\circ g](h) = D_{g(x)}f(D_xg(h)). $$

Yes, your insight that $D_xF \not = F'(x)$ is good and one of the points that was confusing me. The rest of your answer is in line for what I was looking for as well. I am curious to see the full translation from $\nabla$ and $\cdot$ (dot product) notation into summation/index notation. I would appreciate if that was included in this answer, but I think I can also work it out on my own without too much trouble when I have time. — Jagerber48
– Jagerber48, Commented Feb 27, 2024 at 16:28
@Jagerber48 It is exactly your equation ($*$), nothing more than that. — Nicholas Todoroff
– Nicholas Todoroff, Commented Feb 27, 2024 at 17:18
Or if you mean a full proof of this chain rule in index notation, then you essentially write out the "matrix multiplication" of total derivative chain rule: $$\frac d{dx} = \sum_i\frac{dg^i}{dx}\frac\partial{\partial g^i}.$$ — Nicholas Todoroff
– Nicholas Todoroff, Commented Feb 27, 2024 at 17:23

Anakhand · Accepted Answer · 2024-02-29 11:57:06Z

Hint: Matrix multiplication is indeed the reason, once you choose the right basis to represent the total derivatives. Partial derivatives are a special case of directional derivatives, and directional derivatives are related to the total derivative.

Choose the canonical basis $(e_i)_{i = 1}^d$ for $\mathbb{R}^d$. Let $M_y$ and $N_x$ be the matrices of $D_yf$ and $D_x g$ with respect to this basis.

Note that the partial derivative with respect to the $i$-th variable coincides with the directional derivative along $e_i$. Moreover, one of the possible characterizations of the total derivative of $f$ at $y$ is a linear map $D_y$ such that $f(y + v) = f(y) + D_y(v) + R(v)$ where $R(v)$ is a residue term that is small when $v$ is small enough: $\lim_{v \to 0} \frac{R(v)}{\lVert{v}\rVert} = 0$. So, putting everything together: $$\partial_i f(y_1, \ldots, y_d) = \lim_{h \to 0} \frac{f(y_1, \ldots, y_i + h, \ldots, y_d) - f(y_1, \ldots, y_d)}{h} \\ = \lim_{h \to 0} \frac{f(y + he_i) - f(y)}{h} = \lim_{h \to 0} \frac{hD_y(e_i) + R(he_i)}{h} = D_y(e_i)$$

Now, due to the basis we have chosen, $D_y(e_i)$ is the $i$-th column of the matrix $M_y$. Therefore, the $i$-th partial derivative is just the $i$-th column (which consists of a single number) of $M_y$. Something similar also holds for $N_x$.

We can see $D_x(f \circ g) = (f\circ g)'$ as a 1 by 1 matrix. Now you just have to use the chain rule and matrix multiplication of $M_{g(x)}$ and $N_x$ to get the result.

Jagerber48 · Accepted Answer · 2024-03-01 20:59:59Z

$\newcommand{\ep}{\epsilon}$ $\newcommand{\R}{\mathbb{R}}$

I have answer that I find most straightforward with the appropriate background.

Matrix Components

Suppose we have vector spaces $V, W$ with $\dim(V)=n$, $\dim(W)=m$. And suppose we have a linear transformation $T:V\to W$ ($T\in \hom(V,W)$). Suppose we have bases $\left\{e^{(V)}_i\right\}$ for $V$ and $\left\{\ep_{(W)}^i\right\}$ for $W^*$ (The dual space of $W$, the linear maps $W\to \R$, $\hom(W,\R)$).

We can find the matrix $[T]\in M_{m\times n}(\R)$ of $T$ by $$ [T]^i_j = \ep_{(W)}^i\left(T\left(e^{(V)}_j\right)\right) $$ We can think of the matrix $[T]$ as an object in its own right. It is a function that takes two numbers, $i, j$, and gives back the specific real number matrix component.

It is well known that if $T\in \hom(V,W)$ and $S\in \hom(W, U)$ that for $S\circ T \in \hom(V, U)$ with $\dim(U)=p$ that $$ \tag{1} [S\circ T] = [S][T] $$ Where matrix multiplication is defined by $$ \tag{2} \left(AB\right)^i_j = \sum_{k=1}^m A^i_jB^j_k $$ When $A \in M_{m\times n}(\R)$ and $B\in M_{p\times m}(\R)$

The Relationship Between Partial and Total Derivatives

We have the total derivative $D_xf: \R^n \to \R^m$. It is known that the components of the matrix that represents $D_xf$ (with respect to the standard bases on $\R^n$ and $\R^m$), $[D_xf]^i_j$ are the components of the partial derivatives of $f$: $$ \tag{3} (\partial_j f(x))^i = [D_xf]^i_j. $$ When we specialize to $n=m=1$ we get $$ (\partial_1 f(x))^1 = \partial f(x) = f'(x). $$ Where I've used the nonstandard notation that $\partial f(x)$ means the regular derivative of $f(x)$ when $f:\R\to\R$. The tricky thing here is that even for $f:\R\to\R$, it is not the case that $D_xf = f'(x)$. The former is a linear transformation from $\R\to \R$ and the latter is a single number in $\R$. The relationship is $$ \tag{4} f'(x) = (\partial_1 f(x))^1 = [D_xf]^1_1 $$

The Total Derivative Chain Rule

It is well known that for $g: \R^n \to \R^m$ and $f: \R^m \to \R^p$ that $$ \tag{5} D_x(f\circ g) = D_{g(x)}f \circ D_xf $$

Putting it all together

We have that $D_x(f\circ g): \R\to \R$. We know from $(4)$ above that

$$ [D_x(f\circ g)]^1_1 = (f\circ g)'(x) $$ But from the chain rule $(5)$ we know $$ [D_x(f\circ g)] = [D_{g(x)}f \circ D_x g] $$ But because matrix representation turns map compositions into matrix products, seen in $(1)$, we have $$ [D_{g(x)}f \circ D_x g] = [D_{g(x)}f][D_x g] $$ We then write out using the definition of matrix multiplication $(2)$ $$ [D_x(f\circ g)]^1_1 = \sum_{k=1}^m [D_{g(x)}f]^1_k[D_x g]^k_1 $$ But we expand this using partial derivative $(3)$ \begin{align} =& \sum_{k=1}^m (\partial_k f(g(x)))^1(\partial_1 g(x))^k\\ =& \sum_{k=1}^m \partial_k f(g(x)) (g'(x))^k \end{align}

So we have proven $$ (f\circ g)'(x)= \sum_{k=1}^m \partial_k f(g(x)) (g'(x))^k $$

Vector calculus notation

Above we showed $$ [D_x(f\circ g)]^1_1 = \sum_{k=1}^m [D_{g(x)}f]^1_k[D_x g]^k_1 $$ In "standard" vector calculus notation we have \begin{align} [D_x g]^k_1 =& (g'(x))^k\\ [D_{g(x)}f]^1_k =& \partial_k f(g(x)) = (\nabla f(g(x))_k \end{align} The sum is then written as (recalling $[D_x(f\circ g)]^1_1 = (f\circ g)'(x)$) $$ (f\circ g)'(x) = \nabla f(g(x)) \cdot g'(x) $$ where $\cdot$ is the dot or inner product. This is another statement of the standard result.

Jagerber48 · Accepted Answer · 2024-03-01 21:01:04Z

$$ \newcommand{\R}{\mathbb{R}} \newcommand{\bv}[1]{\boldsymbol{#1}} $$

DISCLAIMER!

My other answer is more algebraically motivated and clearer in my opinion. This one churns through the algebra and gets the right answer but turns to components too early in the treatment. See my other answer.

Resolving the main confusion

The insight from Nicholos Todoroff in the other answer addresses one of my confusions. The key is that, while $f'(x)$ is a real number, $D_xf$ is a map from $\mathbb{R}\to \mathbb{R}$, in some sense it is a dual vector to the real numbers. It could be represented as a $1\times1$ matrix. However, there is a correspondence between these two objects: $$ \tag{1} D_xf(h) = f'(x)\cdot h, $$ where $h\in \R$ and $\cdot$ is regular $\R$ multiplication.

In the OP we seek an expression for $(f\circ g)'(x)$ derived using $D_x(f\circ g)$. Using $(1)$, we will, therefore, proceed by calculating $D_x(f\circ g)(h)$.

The Algebra

However, we will proceed more generally than requested in the question for purposes of illustration. Consider $$ f: \R^b \to \R^c\\ g: \R^a \to \R^b. $$ We will calculate $D_x(f\circ g)(h)$ for $h\in \R^a$: $$ D_x(f\circ g)(h) = \left(D_{g(x)}f \circ D_xg\right)(h) = D_{g(x)}f(D_x g(h)). $$ It is a famous result about total derivatives that $$ \tag{2} D_xg\left(e_i^{(a)}\right) = \partial_i g(x) $$ which can apply to $$ D_xg(h) = D_xg\left(\sum_{i=1}^a e^{(a)}_i h^i\right) = \sum_{i=1}^a (\partial_i g) h^i, $$ where $e_i^{(a)}$ are a basis for $\R^a$. Using this we can expand $$ D_{g(x)}f(D_x g(h)) = D_{g(x)}f\left(\sum_{i=1}^a (\partial_i g(x))h^i\right) = \sum_{i=1}^a D_{g(x)}f(\partial_i g(x)) h^i $$ by linearity. We then expand $\partial_i g(x) = \sum_{j=1}^b e_j^{(b)} (\partial_i g(x))^j$ so that $$ = \sum_{i=1}^a D_{g(x)}f\left(\sum_{j=1}^b e_j^{(b)} (\partial_i g(x))^j\right)h^i\\ = \sum_{i=1}^a \sum_{j=1}^b D_{g(x)}f\left(e_j^{(b)}\right) (\partial_i g(x))^j h^i= \sum_{i=1}^a \sum_{j=1}^b \partial_jf(g(x)) (\partial_i g(x))^j h^i. $$ To summarize, we have shown $$ D_x(f\circ g)(h) = \sum_{i=1}^a \sum_{j=1}^b \partial_jf(g(x)) (\partial_i g(x))^j h^i. $$

We now specialize to the question in the OP where $a=c=1$. In this case the $i$ index only ranges over a single value, so we can eliminate that summation and index: $$ \tag{3} D_x(f\circ g)(h) = \sum_{j=1}^b \partial_j f(g(x)) (\partial g(x))^j h = \left(\sum_{i=1}^b \partial_i f(g(x)) (g'(x))^i\right) h, $$ where I've renamed $\partial g(x) = g'(x)$ since $g$ is a function of a single variable and re-indexed $j\to i$. Now, comparing the forms of $(3)$ and $(1)$, we can identify $$ (f\circ g)'(x) = \sum_{i=1}^b \partial_i f(g(x)) (g'(x))^i $$ This is what we were hoping to show.

The Takeaway

The main breakthrough, again, taken from the other answer, was to understand that the regular derivative $(f\circ g)'(x)$ is equal to the single element of the $1\times 1$ matrix which represents $D_x(f\circ g)$ with respect to the standard basis on $\R$ (i.e. $e_1^{(1)} = 1$)

Some Further Notes

We re-express some of what is above using more standard "vector calculus" notation. Let us return to the general case with $$ D_x(f\circ g)(h) = \sum_{i=1}^a \sum_{j=1}^b \partial_jf(g(x)) (\partial_i g(x))^j h^i. $$ We can take the $k^{th}$ component of this expression to get $$ \left(D_x(f\circ g)(h)\right)^k = \sum_{i=1}^a \sum_{j=1}^b \left(\partial_jf(g(x))\right)^k (\partial_i g(x))^j h^i. $$ But we can identify the Jacobians $$ \left[J_f(g(x))\right]_j^k = \left(\partial_jf(g(x))\right)^k\\ \left[J_g(x)\right]_i^j = (\partial_i g(x))^j $$ so that $$ \left(D_x(f\circ g)(h)\right)^k = \sum_{i=1}^a \sum_{j=1}^b \left[J_f(g(x))\right]_j^k \left[J_g(x)\right]_i^j h^i $$ We see that the chain rule composition is represented by multiplication of the Jacobian matrices. In the case where $a=c=1$ we have $$ D_x(f\circ g)(h) = \sum_{j=1}^b \left[J_f(g(x))\right]_j \left[J_g(x)\right]^j h $$ But we can re-write $$ \left[J_f(g(x))\right]_j = (\nabla f(g(x)))_j\\ \left[J_g(x)\right]^j = \left(g'(x)\right)^j $$ That is, the Jacobian matrices degenerate into the gradient and the vector of derivative components. We get $$ D_x(f\circ g)(h) = \nabla \left(f(g(x)) \cdot g'(x)\right) h $$ Again using $(1)$ we find $$ (f\circ g)'(x) = \nabla f(g(x)) \cdot g'(x) $$

Stack Exchange Network

Help proving a chain rule from total derivative chain rule

4 Answers 4

Matrix Components

The Relationship Between Partial and Total Derivatives

The Total Derivative Chain Rule

Putting it all together

Vector calculus notation

DISCLAIMER!

Resolving the main confusion

The Algebra

The Takeaway

Some Further Notes

You must log in to answer this question.

Hot Network Questions

Help proving a chain rule from total derivative chain rule

4 Answers 4

Matrix Components

The Relationship Between Partial and Total Derivatives

The Total Derivative Chain Rule

Putting it all together

Vector calculus notation

DISCLAIMER!

Resolving the main confusion

The Algebra

The Takeaway

Some Further Notes

You must log in to answer this question.

Related

Hot Network Questions