1
$\begingroup$

Let $g: \mathbb{R}^{N_\ell \times N_{\ell-1}} \to \mathbb{R}^{N_\ell} \;\;\;$ $g(W) = Wa$
a function that takes a matrix as an argument, and multiplies it by a vector $a \in \mathbb{R}^{N_\ell}$

Let $h: \mathbb{R}^{N_\ell} \to \mathbb{R} \;$ a differentiable function

I want to differentiate the composition $h \circ g$ with respect to the matrix $W$, so I differentiate with respect to each of its components. I want to use the total derivative of h, and my intuition says that

$\frac{\partial}{\partial W_{j,i}}(h \circ g) = Dh \dfrac{\partial g}{\partial W^\ell_{j,i}}$ where Dh is the total derivative of h, and $\dfrac{\partial g}{\partial W^\ell_{j,i}}$ is the partial derivative of g with respect to the j-row i-column component of the matrix $W$

My questions are: is my intuition correct? If so, why is it? (I'm familiar with the chain rule of total derivatives, but I've never seen it mixed with partial derivatives)

$\endgroup$

1 Answer 1

2
$\begingroup$

Yeah it's ok. You can do the same calculation different ways.

In index notation.

Writing $n=N_\ell$ and $m=N_{\ell-1}$, $$\begin{align} \partial_{ij}(h\circ g) &= \sum_k (\partial_k h\circ g)\partial_{ij} g_k \\ &= \begin{bmatrix} \partial_1h\circ g & \cdots & \partial_nh\circ g \end{bmatrix} \begin{bmatrix} \partial_{ij}g_1 \\ \vdots \\ \partial_{ij}g_n \\ \end{bmatrix} \end{align}$$ As you wanted.

Now, for your specific problem since $g_k(W)=\sum_r W_{kr}a_r$, you have $\partial_{ij}g_k=\delta_{ik}a_j$, so your derivative is $$\begin{align} \partial_{ij}(h\circ g) &= \sum_k (\partial_k h\circ g)\partial_{ij} g_k \\ &= \sum_k (\partial_k h\circ g)\delta_{ik}a_j \\ &= a_j(\partial_i h\circ g). \end{align}$$

In differential notation

This is also nice. Taking $\nabla h$ as a row vector and $a$ as a column vector $$\begin{align} d(h(Wa)) &= \nabla h(Wa):d(Wa) \\ &= \nabla h(Wa):dWa \\ &= dWa:\nabla h(Wa) \\ &= dW:a\nabla h(Wa) \\ \end{align}$$ so that $$ \frac{dh(Wa)}{dW} = a\nabla h(Wa). $$

By definition of the derivative

The derivative of a function $f$ at $W$ is a linear operator $Df(W)$ such that $$ f(W+H) - f(W) = Df(W)(H) $$ for any infinitesimal matrix $H$. You can also say it in terms of limits or using the big $O$ notation, but this way is easier in notation. Now take $f(W)=h(Wa)$, so that Ignoring terms of higher order in $H$, we have $$ \begin{align} f(W+H) - f(W) &= -h(Wa) + h(Wa+Ha) \\ &= -h(Wa) + h(Wa) + Dh(Wa)(Ha) \\ &= Dh(Wa)(Ha) \\ &= \nabla h(Wa)Ha \\ &= \nabla h(Wa):Ha \\ &= Ha:\nabla h(Wa) \\ &= H:a\nabla h(Wa) \\ &= a\nabla h(Wa):H. \end{align} $$ The term $a\nabla h(Wa):H$ is linear in $H$, so this should be $Df(W)(H)$. $$ Df(W)(H) = a\nabla h(Wa):H. $$

$\endgroup$

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.