1
$\begingroup$

I'm looking for a VERY DETAILED demonstration for the backward propagation algorithm in neural networks machine learning. Specifically the step below.

I've got the excellent Michael Nielsen demonstration, but I struggle to understand the step between formula (40):

$$\delta_j^L = \dfrac { \partial C} {\partial {z_j^L}} $$

and formula (41):

$$\delta_j^L = \sum_k \dfrac { \partial C} {\partial {a_k^L}} \dfrac {\partial {a_k^L}} {\partial {z_j^L}} $$

Which then gives (I understand this last step):

$$ \delta_j^L = \dfrac { \partial C} {\partial {a_j^L}} \dfrac {\partial {a_j^L}} {\partial {z_j^L}} $$

I suppose it's linked with the chain rule I've seen the partial derivative of a sum of two vectors but not the kind of sum in my example above.

Any help?

$\endgroup$
1
  • $\begingroup$ Maybe what i need is a demonstration of this property of the chain rule: calculus.subwiki.org/wiki/… $\endgroup$ Commented May 9, 2015 at 15:21

1 Answer 1

1
$\begingroup$

The step between Equation 40 and 41 is, as you have guessed, an application of the chain rule for multivariable functions. If $C$ depends on $z$ only through $a_1, ..., a_K$, then we have

$$\frac{dC}{dz} = \sum_k \frac{\partial C}{\partial a_k} \frac{d a_k}{d z}. $$

Here is a simple example:

$$C = x^2 + xy, \quad x = 2z, \quad y = z^2.$$

The chain rule allows us to compute the derivate of $C$ with respect to $z$ as

\begin{align} \frac{dC}{dz} &= \frac{\partial C}{\partial x} \frac{d x}{d z} + \frac{\partial C}{\partial y} \frac{d y}{d z} \\ &= (2x + y) 2 + x (2z) \\ &= 4x + 2y + 2xz \\ &= 8z + 6z^2, \end{align}

which is the same result as the one we get by replacing $x$ and $y$ first and computing the derivative of $(2z)^2 + 2z^3$ directly.

$\endgroup$
6
  • $\begingroup$ Thanks. So that what I tought. Do you have a link to the demonstration of this Chain rule for multi-variable functions? $\endgroup$ Commented May 10, 2015 at 16:31
  • $\begingroup$ @Lucas Why there is k at the formulation? I think eliminating summation and turning k into j will be more appropriate since error of a unit j is only related with its activation not the other activations at the same layer. Where k comes from? What it represents? $\endgroup$ Commented May 10, 2015 at 21:47
  • $\begingroup$ @tmangin: What is wrong with the link you gave in the comment to your question? That seems like a good demonstration to me. $\endgroup$ Commented May 11, 2015 at 7:43
  • $\begingroup$ @yasin.yazici: It doesn't matter whether I call the index $k$, $i$, $j$ or something else. It will always refer to $a_1$, $a_2$, $a_3$, and so on. $\endgroup$ Commented May 11, 2015 at 7:46
  • $\begingroup$ @Lucas What I mean j and k in the question not in your post. $ \frac{\partial a_k^L}{\partial z_j^L}$ is not 0 only if j=k. Hence, summation on other indexes are redundant. Isn't it? $\endgroup$ Commented May 11, 2015 at 8:34

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.