Backward propagation algorithm demonstration in neural networks: any VERY-SMALL-STEP by VERY-SMALL-STEP demonstration?

Question

I'm looking for a VERY DETAILED demonstration for the backward propagation algorithm in neural networks machine learning. Specifically the step below.

I've got the excellent Michael Nielsen demonstration, but I struggle to understand the step between formula (40):

$$\delta_j^L = \dfrac { \partial C} {\partial {z_j^L}} $$

and formula (41):

$$\delta_j^L = \sum_k \dfrac { \partial C} {\partial {a_k^L}} \dfrac {\partial {a_k^L}} {\partial {z_j^L}} $$

Which then gives (I understand this last step):

$$ \delta_j^L = \dfrac { \partial C} {\partial {a_j^L}} \dfrac {\partial {a_j^L}} {\partial {z_j^L}} $$

I suppose it's linked with the chain rule I've seen the partial derivative of a sum of two vectors but not the kind of sum in my example above.

Any help?

Maybe what i need is a demonstration of this property of the chain rule: calculus.subwiki.org/wiki/… — tmangin
– tmangin, Commented May 9, 2015 at 15:21

Lucas · Accepted Answer · 2015-05-09 15:30:10Z

The step between Equation 40 and 41 is, as you have guessed, an application of the chain rule for multivariable functions. If $C$ depends on $z$ only through $a_1, ..., a_K$, then we have

$$\frac{dC}{dz} = \sum_k \frac{\partial C}{\partial a_k} \frac{d a_k}{d z}. $$

Here is a simple example:

$$C = x^2 + xy, \quad x = 2z, \quad y = z^2.$$

The chain rule allows us to compute the derivate of $C$ with respect to $z$ as

\begin{align} \frac{dC}{dz} &= \frac{\partial C}{\partial x} \frac{d x}{d z} + \frac{\partial C}{\partial y} \frac{d y}{d z} \\ &= (2x + y) 2 + x (2z) \\ &= 4x + 2y + 2xz \\ &= 8z + 6z^2, \end{align}

which is the same result as the one we get by replacing $x$ and $y$ first and computing the derivative of $(2z)^2 + 2z^3$ directly.

Thanks. So that what I tought. Do you have a link to the demonstration of this Chain rule for multi-variable functions? — tmangin
– tmangin, Commented May 10, 2015 at 16:31
@Lucas Why there is k at the formulation? I think eliminating summation and turning k into j will be more appropriate since error of a unit j is only related with its activation not the other activations at the same layer. Where k comes from? What it represents? — yasin.yazici
– yasin.yazici, Commented May 10, 2015 at 21:47
@tmangin: What is wrong with the link you gave in the comment to your question? That seems like a good demonstration to me. — Lucas
– Lucas, Commented May 11, 2015 at 7:43
@yasin.yazici: It doesn't matter whether I call the index $k$, $i$, $j$ or something else. It will always refer to $a_1$, $a_2$, $a_3$, and so on. — Lucas
– Lucas, Commented May 11, 2015 at 7:46
@Lucas What I mean j and k in the question not in your post. $ \frac{\partial a_k^L}{\partial z_j^L}$ is not 0 only if j=k. Hence, summation on other indexes are redundant. Isn't it? — yasin.yazici
– yasin.yazici, Commented May 11, 2015 at 8:34

Stack Exchange Network

Backward propagation algorithm demonstration in neural networks: any VERY-SMALL-STEP by VERY-SMALL-STEP demonstration?

1 Answer 1

Hot Network Questions

Backward propagation algorithm demonstration in neural networks: any VERY-SMALL-STEP by VERY-SMALL-STEP demonstration?

1 Answer 1

Related

Hot Network Questions