2
$\begingroup$

I am trying to understand the gradients of backpropagation through time for a simple recurrent neural network. In particular this one: https://arxiv.org/abs/1211.5063 (Section 1.1)

(Also here: https://cs224d.stanford.edu/lectures/CS224d-Lecture8.pdf and other blog posts that just refer to the chain rule.

EDIT: https://www.deeplearningbook.org/contents/rnn.html page 379)

The update for the "hidden state" is: $$\mathbf{x}_t=\mathbf{W}\sigma(\mathbf{x}_{t-1}) + \mathbf{W}_{\mathrm{in}}\mathbf{u}_t + \mathbf{b}$$

Calculating the gradients makes sense up to the step when it comes to $$\frac{\partial \mathbf{x}_i}{\partial\mathbf{x}_{i-1}} = \mathbf{W}^\top diag(\sigma'(\mathbf{x}_{i-1})$$

Why is it $\mathbf{W}^\top$? I tried to come to the same result for a simple 2x2 example but I get a different result:

$$\mathbf{y} = \mathbf{W}\sigma(\mathbf{x})=\begin{bmatrix} w_{1,1} & w_{1,2} \\ w_{2,1} & w_{2,2}\end{bmatrix}\begin{bmatrix} \sigma(x_1) \\ \sigma(x_2)\end{bmatrix} = \begin{bmatrix} w_{1,1}\sigma(x_1) + w_{1,2}\sigma(x_2)\\ w_{2,1}\sigma(x_1) + w_{2,2}\sigma(x_2)\end{bmatrix}=\begin{bmatrix} y_1 \\ y_2\end{bmatrix}$$

$$\frac{\partial \mathbf{y}}{\partial \mathbf{x}} = \begin{bmatrix} \frac{\partial y_1}{\partial x_1} & \frac{\partial y_1}{\partial x_2} \\ \frac{\partial y_2}{\partial x_1} & \frac{\partial y_2}{\partial x_2}\end{bmatrix} = \begin{bmatrix} w_{1,1}\sigma'(x_1) & w_{1,2}\sigma'(x_2)\\ w_{2,1}\sigma'(x_1) & w_{2,2}\sigma'(x_2)\end{bmatrix} = \begin{bmatrix} w_{1,1} & w_{1,2} \\ w_{2,1} & w_{2,2}\end{bmatrix}\begin{bmatrix} \sigma'(x_1) & 0 \\ 0 & \sigma'(x_2)\end{bmatrix}=\mathbf{W} diag(\sigma'(\mathbf{x}))$$ This is also what I would expect applying the chain rule (with $\mathbf{z}=\sigma(\mathbf{x})$): $$\frac{\partial \mathbf{y}}{\partial \mathbf{x}} = \frac{\partial \mathbf{y}}{\partial \mathbf{z}} \frac{\partial \mathbf{z}}{\partial \mathbf{x}} = \mathbf{W}\frac{\partial \mathbf{z}}{\partial \mathbf{x}} =\mathbf{W} diag(\sigma'(\mathbf{x})) $$

What am I missing / doing wrong here?

Thank you for your help!

$\endgroup$

1 Answer 1

2
$\begingroup$

I believe you're right. The paper seems to use the numerator layout as the chain rule expands rightwards, which is the same as your calculations. So, for example, if the equation was $x=Wz$, and we were interested in $\frac{\partial Wz}{\partial z}$, the answer would have been $W$, not $W^T$. You can look at the third entry here. This simple example assumes $\sigma(z)=z$.

I'm pretty surprised that the analysis in the paper follows incorrectly. Nevertheless, that way would still show a lot about the exploding/vanishing gradients.

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.