In machine learning, it is typical to see a so-called weight matrix. As a low-dimensional example, let this matrix be defined as,
$$W = \begin{bmatrix} w_{11} & w_{12} \\\ w_{21} & w_{22} \end{bmatrix}$$
Let $x \in \mathbb{R}^n$ and let $\theta$ be some element-wise nonlinear function.
Then $L(W, x) = \dfrac{1}{2}\|\theta(Wx)\|_2^2$ is a simple toy example of the so-called loss function of a neural network.
The derivative of $L$ against $W$ is of utmost importance. However, I am not quite clear on exactly how the chain rule works in this case.
Suppose we define the variable $z = Wx$
Then the "chain rule" seems to suggest $$\dfrac{\partial L}{\partial W} = \dfrac{\partial L}{\partial z} \dfrac{\partial z}{\partial W}$$
Here, $z$ can be seen as a two-argument function, $z(W, x): \mathbb{R}^{2 \times 2} \times \mathbb{R}^2 \to \mathbb{R}^2$
I am not quite clear on how derivative is defined for these types of functions, and I'm not sure if this chain-rule works, and I'm also curious as to what $\dfrac{\partial z}{\partial W}$ is.
In some literature, I've seen that this matrix $\dfrac{\partial z}{\partial W}$ is called "Jacobian" (thereby defining something known as Jacobian-vector product, or vector-Jacobian product). However, from my limited understanding, the Jacobian is defined for a vector-valued function $f: \mathbb{R}^n \to \mathbb{R}^m$, and this seems to be some kind of higher-dimensional Jacobian.
Can someone provide some guidance as to how I should properly define $\dfrac{\partial z}{\partial W}$ (and whether the chain-rule works in this scenario).
Any reference helps!