I'm first learning about backpropagation in neural networks. We're doing stochastic gradient descent.
The lecture provides incomplete detail on computing the derivatives for the final layer.
We have the following chain rule equation for the partial of the change in the error as a function of a given weight:
$$ \frac{\partial \mathrm{e}(\mathbf{w})}{\partial w_{i j}^{(l)}}=\frac{\partial \mathrm{e}(\mathbf{w})}{\partial s_j^{(l)}} \times \frac{\partial s_j^{(l)}}{\partial w_{i j}^{(l)}} $$
$s$ stands for the signal (sum of weights $w$ times inputs $x$ from the previous layer). The second partial derivative, of the signal with regard to the weight is simple: $\frac{\partial s_j^{(l)}}{\partial w_{i j}^{(l)}}=x_i^{(l-1)}$.
We only need $\frac{\partial \mathrm{e}(\mathrm{w})}{\partial s_j^{(l)}}=\delta_j^{(l)}$.
For the final layer $l=L$ and $j=1$.
I'm trying to solve for $\delta_1^{(L)}=\frac{\partial \mathrm{e}(\mathbf{w})}{\partial s_1^{(L)}}$
The error function is $\mathrm{e}(\mathrm{w})=\left(x_1^{(L)}-y_n\right)^2$.
$x_1^{(L)}=\theta\left(s_1^{(L)}\right)$.
Our $\theta$ is tanh, so $\theta^{\prime}(s)=1-\theta^2(s)$.
How do I calculate $\delta_1^{(L)}$? I'm guessing its just an application of the chain rule, but I want to make sure I get it right before proceeding with back propagation on prior layers.