Why ReLU halves the variance in He initialization?

Question

I am reading the paper Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification and I can't understand why (see p.4):

$E[x^2_l] = \frac{1}{2}Var[y_{l-1}]$

Let $w_{l-1}$ (weights of layer $l-1$) have a symmetric distribution around zero. Let $b_{l-1} = 0$ (the biases of layer $l-1$).

Then, $$y_{l-1} = w_{l-1}\cdot x_{l-1} + b_{l-1}$$

has zero mean and has a symmetric distribution around zero.

What I don't get is why after passing $y_{l-1}$ through ReLU, the output $x_l = \max(0, y_{l-1})$ satisfies the quoted equation?

I checked the following Lecture and at 48:07 Andrej Karpathy says that ReLU halves the variance. Any ideas?

Mahmoud · Accepted Answer · 2023-06-09 19:23:31Z

I'm quoting the relevant part in the left column of page 4 (above equation (8)) in the paper you mentioned to give context:

If we let $w_{l−1}$ have a symmetric distribution around zero and $b_{l−1} = 0$, then $y_{l−1}$ has zero mean and has a symmetric distribution around zero. This leads to $E[x_l^2] = \frac{1}{2}\text{Var}[y_{l-1}]$ when $f$ is ReLU.

We are also given that $x_l = \max(0,y_{l-1})$. We can then use the law of total expectation to proceed as follows: \begin{align} E[x_l^2] &= p(y_{l-1} \geq 0)E[x_l^2 \mid y_{l-1} \geq 0] + p(y_{l-1} < 0) E[x_l^2 \mid y_{l-1} < 0] \end{align} However, when $y_{l-1} < 0$, $x_l = \max(0,y_{l-1}) = 0$, and so $E[x_l^2 \mid y_{l-1} < 0] = 0$. Therefore, \begin{align} E[x_l^2] &= p(y_{l-1} \geq 0)E[x_l^2 \mid y_{l-1} \geq 0] \end{align} Because $y_{l-1}$ has a symmetric distribution around zero, then $p(y_{l-1} \geq 0) = \frac{1}{2}$. So, \begin{align} E[x_l^2] &= \frac{1}{2}E[x_l^2 \mid y_{l-1} \geq 0] \end{align} Moreover, we have that \begin{align} E[x_l^2 \mid y_{l-1} \geq 0] &= E[(\max(0,y_{l-1}))^2 \mid y_{l-1} \geq 0] \\ &= E[y_{l-1}^2 \mid y_{l-1} \geq 0] \end{align} Because $y_{l-1}^2 \geq 0$ for every $y_{l-1} \in \mathbb R$, then we do not gain any new information about $y_{l-1}^2$ by learning that $y_{l-1} \geq 0$. This implies that $y_{l-1}^2$ is independent of the event $y_{l-1} \geq 0$. Therefore, \begin{align} E[x_l^2 \mid y_{l-1} \geq 0] &= E[y_{l-1}^2 \mid y_{l-1} \geq 0] \\ &= E[y_{l-1}^2] \end{align} Finally, because $y_{l-1}$ has zero mean, then $E[y_{l-1}^2] = \text{Var}(y_{l-1})$, and so $E[x_l^2] = \frac{1}{2}\text{Var}[y_{l-1}]$.

Thanks for this clear derivation. The only thing I am not sure I get is the justification of $E[y^2_{l-1} | y_{l-1} \geq 0] = E[y^2_{l-1}]$. Of course we don't get any new information about $y^2_{l-1}$ by learning that $y_{l-1}=0$ but this doesn't mean that the conditional expectation equals the expectation. However if we follow the law of total expectation we get: $E[y^2_{l-1}] = p(y_{l-1} < 0) E[y^2_{l-1} | y_{l-1} <0] + p(y_{l-1} \geq 0) E[y^2_{l-1} | y_{l-1} \geq 0]$. The equality of conditional probabilites and expectations (symmetric distribution around 0) leads to the desired result. — Antonios Sarikas
– Antonios Sarikas, Commented Jun 9, 2023 at 18:13
@adosar not learning anything implies independence, but your method works too. I’ve updated my answer. — Mahmoud
– Mahmoud, Commented Jun 9, 2023 at 19:23

Stack Exchange Network

Why ReLU halves the variance in He initialization?

1 Answer 1

Hot Network Questions

Why ReLU halves the variance in He initialization?

1 Answer 1

Related

Hot Network Questions