Return to Answer

Addendum

edited Jul 16, 2018 at 1:34

My understanding is that you have posted the last equation on p. 19 in the dissertation (please correct me if I'm wrong).

The derivation is indeed for a single LSTM layer as e_k(t) is the error at the output (the assumed network topology is also mentioned just below Eq. (2.8) in the dissertation):

Finally, assuming a layered network topology with a standard input layer, a hidden layer consisting of memory blocks, and a standard output layer[...]

Let's assume now that there are two stacked LSTM layers. In this case, the derivation in the dissertation applies to the last hidden LSTM layer (the one below the output layer), where Eqs. (3.10) -- (3.12) give you the partial derivatives for the weights at each gate for a cell in that layer. To derive the deltas for the hidden LSTM layer below, you have to compute the partial derivatives with respect to the portions of net_{c_v^j}(t), net_{in_j}(t) and net_{f_j}(t) terms corresponding to the outputs of the preceding hidden LSTM layer, and then use those in the same way you used e_k(t) for the current LSTM layer. Underneath all that unappealing notation is just the usual multi-layer backprop rule (well, with the truncated RTRL twist). If you stack more LSTM layers, just keep propagating the errors further down through the respective gates until you reach the input layer.

For a slightly more intuitive explanation, if you look at Fig. 2.1 in the dissertation, you can assume that in a multi-layered LSTM the IN in fact includes the OUT of the preceding LSTM layer.

Edit

There is a nice diagram of the flow of partial derivatives here (also see subsequent slides).

In this example, all x^t terms represent the external input to that layer at time step t. In a multi-layer LSTM, this includes input from the LSTM layer below the current one. To propagate the error to the layer below, take the derivatives with respect to all gate weights corresponding to x^t and apply the chain rule.

The reason why the derivation for the output gate weights is different is that the gating is applied after the cell state is updated, whereas the cell, input and forget gates are applied before that. This matters when computing the gradients.

My understanding is that you have posted the last equation on p. 19 in the dissertation (please correct me if I'm wrong).

The derivation is indeed for a single LSTM layer as e_k(t) is the error at the output (the assumed network topology is also mentioned just below Eq. (2.8) in the dissertation):

Finally, assuming a layered network topology with a standard input layer, a hidden layer consisting of memory blocks, and a standard output layer[...]

For a slightly more intuitive explanation, if you look at Fig. 2.1 in the dissertation, you can assume that in a multi-layered LSTM the IN in fact includes the OUT of the preceding LSTM layer.

My understanding is that you have posted the last equation on p. 19 in the dissertation (please correct me if I'm wrong).

The derivation is indeed for a single LSTM layer as e_k(t) is the error at the output (the assumed network topology is also mentioned just below Eq. (2.8) in the dissertation):

Finally, assuming a layered network topology with a standard input layer, a hidden layer consisting of memory blocks, and a standard output layer[...]

For a slightly more intuitive explanation, if you look at Fig. 2.1 in the dissertation, you can assume that in a multi-layered LSTM the IN in fact includes the OUT of the preceding LSTM layer.

Edit

There is a nice diagram of the flow of partial derivatives here (also see subsequent slides).

Source Link

answered Jul 10, 2018 at 11:34

cantordust

My understanding is that you have posted the last equation on p. 19 in the dissertation (please correct me if I'm wrong).

The derivation is indeed for a single LSTM layer as e_k(t) is the error at the output (the assumed network topology is also mentioned just below Eq. (2.8) in the dissertation):

Finally, assuming a layered network topology with a standard input layer, a hidden layer consisting of memory blocks, and a standard output layer[...]

For a slightly more intuitive explanation, if you look at Fig. 2.1 in the dissertation, you can assume that in a multi-layered LSTM the IN in fact includes the OUT of the preceding LSTM layer.