2
$\begingroup$

In machine learning, it is typical to see a so-called weight matrix. As a low-dimensional example, let this matrix be defined as,

$$W = \begin{bmatrix} w_{11} & w_{12} \\\ w_{21} & w_{22} \end{bmatrix}$$

Let $x \in \mathbb{R}^n$ and let $\theta$ be some element-wise nonlinear function.

Then $L(W, x) = \dfrac{1}{2}\|\theta(Wx)\|_2^2$ is a simple toy example of the so-called loss function of a neural network.

The derivative of $L$ against $W$ is of utmost importance. However, I am not quite clear on exactly how the chain rule works in this case.

Suppose we define the variable $z = Wx$

Then the "chain rule" seems to suggest $$\dfrac{\partial L}{\partial W} = \dfrac{\partial L}{\partial z} \dfrac{\partial z}{\partial W}$$

Here, $z$ can be seen as a two-argument function, $z(W, x): \mathbb{R}^{2 \times 2} \times \mathbb{R}^2 \to \mathbb{R}^2$

I am not quite clear on how derivative is defined for these types of functions, and I'm not sure if this chain-rule works, and I'm also curious as to what $\dfrac{\partial z}{\partial W}$ is.

In some literature, I've seen that this matrix $\dfrac{\partial z}{\partial W}$ is called "Jacobian" (thereby defining something known as Jacobian-vector product, or vector-Jacobian product). However, from my limited understanding, the Jacobian is defined for a vector-valued function $f: \mathbb{R}^n \to \mathbb{R}^m$, and this seems to be some kind of higher-dimensional Jacobian.

Can someone provide some guidance as to how I should properly define $\dfrac{\partial z}{\partial W}$ (and whether the chain-rule works in this scenario).

Any reference helps!

$\endgroup$
6
  • $\begingroup$ Hmm. So $\theta$ is $\mathbb{R}^2 \rightarrow \mathbb{R}^2 $? Or does it have real values? -- The chain rule is always true :-). $\endgroup$ Commented Sep 26 at 16:20
  • $\begingroup$ Let $g: \mathbb{R}^{2 \times 2}\times \mathbb{R}^2\rightarrow \mathbb{R}^2$ be given by $g(W,x)=Wx$. Let $\theta:\mathbb{R}^2\rightarrow \mathbb{R}^2$ be as in your question, and let $f:\mathbb{R}^2\rightarrow \mathbb{R}$ be given by $f(x)=|x|^2/2$. Then you are interested in the partial derivatives of the composite $f\circ \theta \circ g$, which you find via the chain rule. $\endgroup$ Commented Sep 26 at 16:43
  • $\begingroup$ Where did you get these formula ? They don't match the actual behavior of a neural network. $\endgroup$ Commented Sep 26 at 18:51
  • $\begingroup$ @Digitallis $x$ is the input, $W$ is the weight, $\theta$ is the nonlinearity, the 2 norm is the loss (targets are zero for convenience). $\endgroup$ Commented Sep 28 at 18:40
  • 1
    $\begingroup$ See this question. At the end of the answer it says: The standard text for this stuff is Matrix Differential Calculus by Magnus & Neudecker. $\endgroup$ Commented Sep 29 at 15:23

2 Answers 2

4
+25
$\begingroup$

If you’re very used to thinking of the derivative as a linear transformation (which is the default viewpoint for mathematicians) then this is a nice way to do the calculation.

Suppose $f:\mathbb R^{m \times n} \to \mathbb R$ is defined by $f(W) = g(Wx)$, where $x \in \mathbb R^n$ is a given vector and $g:\mathbb R^m \to \mathbb R$ is a smooth function. Then $f = g \circ h$, where $h: \mathbb R^{m \times n} \to \mathbb R^m$ is the linear transformation defined by $h(W) = Wx$. By the chain rule, $$ Df(W) = Dg(h(W)) \circ Dh(W). $$ If $u \in \mathbb R^{m \times n}$ then \begin{align*} Df(W)(u) &= Dg(h(W))(Dh(W)(u)) \\ &= Dg(h(W))(h(u))\\ &= g’(Wx) ux \\ &=\text{trace}(\nabla g(Wx)^T ux) \\ &= \text{trace}(x \nabla g(Wx)^T u). \end{align*} Comparing this with $$ Df(W)(u) = \text{trace}(\nabla f(W) u) $$ reveals that $$ \nabla f(W) = x \nabla g(Wx)^T. $$

$\endgroup$
1
$\begingroup$

You are correct to identify this as the Jacobian. However, I suspect it is a Jacobian of a different function than the one you are thinking of.

The function $z(W, x): \mathbb{R}^{2×2}×\mathbb{R}^2 \to \mathbb{R}^2$ is a function of two vector variables. Usually when you are taking a Jacobian, you will consider $W$ to be fixed, thus you get a function of a single variable, $z(x) : \mathbb{R}^2 \to \mathbb{R}^2$. When you take the jacobian of this function, you get a $2×2$ Jacobian matrix.

However, in your case, you are not treating $W$ as fixed, but you are treating $x$ as fixed, and so you get a different function of a single variable, $z(W) :\mathbb{R}^{2×2} \to \mathbb{R}^2$. Now, the simplest way to think about the derivative of this function, is to flatten out the 2×2 matrix into a vector in $\mathbb{R}^4$, giving you a linear function from $\mathbb{R}^{4} \to \mathbb{R}^2$. Thus we have returned to the familiar world of vector valued functions from $\mathbb{R}^n \to \mathbb{R}^m$.

If you really want to preserve the $2×2$ structure of $W$, you have to think about $\dfrac{\partial z}{\partial W}$ as a 3-dimensional tensor instead.

$\endgroup$

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.