Machine learning: what is the proper name for derivative of a function against a matrix?

Question

In machine learning, it is typical to see a so-called weight matrix. As a low-dimensional example, let this matrix be defined as,

$$W = \begin{bmatrix} w_{11} & w_{12} \\\ w_{21} & w_{22} \end{bmatrix}$$

Let $x \in \mathbb{R}^n$ and let $\theta$ be some element-wise nonlinear function.

Then $L(W, x) = \dfrac{1}{2}\|\theta(Wx)\|_2^2$ is a simple toy example of the so-called loss function of a neural network.

The derivative of $L$ against $W$ is of utmost importance. However, I am not quite clear on exactly how the chain rule works in this case.

Suppose we define the variable $z = Wx$

Then the "chain rule" seems to suggest $$\dfrac{\partial L}{\partial W} = \dfrac{\partial L}{\partial z} \dfrac{\partial z}{\partial W}$$

Here, $z$ can be seen as a two-argument function, $z(W, x): \mathbb{R}^{2 \times 2} \times \mathbb{R}^2 \to \mathbb{R}^2$

I am not quite clear on how derivative is defined for these types of functions, and I'm not sure if this chain-rule works, and I'm also curious as to what $\dfrac{\partial z}{\partial W}$ is.

In some literature, I've seen that this matrix $\dfrac{\partial z}{\partial W}$ is called "Jacobian" (thereby defining something known as Jacobian-vector product, or vector-Jacobian product). However, from my limited understanding, the Jacobian is defined for a vector-valued function $f: \mathbb{R}^n \to \mathbb{R}^m$, and this seems to be some kind of higher-dimensional Jacobian.

Can someone provide some guidance as to how I should properly define $\dfrac{\partial z}{\partial W}$ (and whether the chain-rule works in this scenario).

Any reference helps!

Hmm. So $\theta$ is $\mathbb{R}^2 \rightarrow \mathbb{R}^2 $? Or does it have real values? -- The chain rule is always true :-). — Thomas
– Thomas, Commented Sep 26 at 16:20
Let $g: \mathbb{R}^{2 \times 2}\times \mathbb{R}^2\rightarrow \mathbb{R}^2$ be given by $g(W,x)=Wx$. Let $\theta:\mathbb{R}^2\rightarrow \mathbb{R}^2$ be as in your question, and let $f:\mathbb{R}^2\rightarrow \mathbb{R}$ be given by $f(x)=|x|^2/2$. Then you are interested in the partial derivatives of the composite $f\circ \theta \circ g$, which you find via the chain rule. — Malady
– Malady, Commented Sep 26 at 16:43
Where did you get these formula ? They don't match the actual behavior of a neural network. — Digitallis
– Digitallis, Commented Sep 26 at 18:51
@Digitallis $x$ is the input, $W$ is the weight, $\theta$ is the nonlinearity, the 2 norm is the loss (targets are zero for convenience). — Your neighbor Todorovich
– Your neighbor Todorovich, Commented Sep 28 at 18:40
See this question. At the end of the answer it says: The standard text for this stuff is Matrix Differential Calculus by Magnus & Neudecker. — polfosol
– polfosol, Commented Sep 29 at 15:23

littleO · Accepted Answer · 2025-09-30 03:00:48Z

If you’re very used to thinking of the derivative as a linear transformation (which is the default viewpoint for mathematicians) then this is a nice way to do the calculation.

Suppose $f:\mathbb R^{m \times n} \to \mathbb R$ is defined by $f(W) = g(Wx)$, where $x \in \mathbb R^n$ is a given vector and $g:\mathbb R^m \to \mathbb R$ is a smooth function. Then $f = g \circ h$, where $h: \mathbb R^{m \times n} \to \mathbb R^m$ is the linear transformation defined by $h(W) = Wx$. By the chain rule, $$ Df(W) = Dg(h(W)) \circ Dh(W). $$ If $u \in \mathbb R^{m \times n}$ then \begin{align*} Df(W)(u) &= Dg(h(W))(Dh(W)(u)) \\ &= Dg(h(W))(h(u))\\ &= g’(Wx) ux \\ &=\text{trace}(\nabla g(Wx)^T ux) \\ &= \text{trace}(x \nabla g(Wx)^T u). \end{align*} Comparing this with $$ Df(W)(u) = \text{trace}(\nabla f(W) u) $$ reveals that $$ \nabla f(W) = x \nabla g(Wx)^T. $$

Jeremy Salwen · Accepted Answer · 2025-09-30 01:43:51Z

You are correct to identify this as the Jacobian. However, I suspect it is a Jacobian of a different function than the one you are thinking of.

The function $z(W, x): \mathbb{R}^{2×2}×\mathbb{R}^2 \to \mathbb{R}^2$ is a function of two vector variables. Usually when you are taking a Jacobian, you will consider $W$ to be fixed, thus you get a function of a single variable, $z(x) : \mathbb{R}^2 \to \mathbb{R}^2$. When you take the jacobian of this function, you get a $2×2$ Jacobian matrix.

However, in your case, you are not treating $W$ as fixed, but you are treating $x$ as fixed, and so you get a different function of a single variable, $z(W) :\mathbb{R}^{2×2} \to \mathbb{R}^2$. Now, the simplest way to think about the derivative of this function, is to flatten out the 2×2 matrix into a vector in $\mathbb{R}^4$, giving you a linear function from $\mathbb{R}^{4} \to \mathbb{R}^2$. Thus we have returned to the familiar world of vector valued functions from $\mathbb{R}^n \to \mathbb{R}^m$.

If you really want to preserve the $2×2$ structure of $W$, you have to think about $\dfrac{\partial z}{\partial W}$ as a 3-dimensional tensor instead.

Stack Exchange Network

Machine learning: what is the proper name for derivative of a function against a matrix?

2 Answers 2

You must log in to answer this question.

Linked

Hot Network Questions

Machine learning: what is the proper name for derivative of a function against a matrix?

2 Answers 2

You must log in to answer this question.

Linked

Related

Hot Network Questions