1
$\begingroup$

Consider the following network:

enter image description here

There are two weights, say $w_1$ and $w_2$, and two biases, $b_1$ and $b_2$. The hidden layer has a ReLU activation function $g^{(1)}$ and the output layer has a linear activation function $g^{(2)}$.

Say we have a dataset of $M$ datapoints, each of the form $(x^{[m]}, y^{[m]})$ (1D inputs/outputs)

The network's predictions are \begin{align*} \hat{y}^{[m]} = g^{(2)}\left(w_2 \left[g^{(1)} \left(w_1 x^{[m]} + b_1 \right)\right] + b_2\right) = g^{(2)}\left(w_2 {a_1^{(1)}}^{[m]}+ b_2\right) \,. \end{align*} Assuming a linear activation function in the output layer, this simplifies to \begin{align*} \hat{y}^{[m]} =w_2 \left[g^{(1)} \left(w_1 x^{[m]} + b_1 \right)\right] + b_2 = w_2 {a_1^{(1)}}^{[m]} + b_2 \,. \end{align*} The MSE loss is $$L_\text{MSE}=\frac{1}{M}\sum_{m=1}^M L_m = \frac{1}{M} \sum_{m=1}^M \left(\hat{y}^{[m]} - y^{[m]} \right)^2 \,.$$

The derivatives of the MSE loss with respect to the network parameters are: $$ \frac{\partial L_\text{MSE}}{\partial b_2} = \frac{1}{M} \sum_{m=1}^M \frac{\partial}{\partial b_2} \left[\left(\hat{y}^{[m]} - y^{[m]} \right)^2 \right] = \frac{1}{M} \sum_{m=1}^M 2\left(\hat{y}^{[m]} - y^{[m]} \right) \cdot 1 \, $$ $$\frac{\partial L_\text{MSE}}{\partial w_2} = \frac{1}{M} \sum_{m=1}^M \frac{\partial}{\partial w_2} \left[\left(\hat{y}^{[m]} - y^{[m]} \right)^2 \right] = \frac{1}{M} \sum_{m=1}^M 2\left(\hat{y}^{[m]} - y^{[m]} \right) {a_1^{(1)}}^{[m]} \,$$ $$\frac{\partial L_\text{MSE}}{\partial b_1} = \frac{1}{M} \sum_{m=1}^M \frac{\partial}{\partial b_1} \left[\left(\hat{y}^{[m]} - y^{[m]} \right)^2 \right] = \frac{1}{M} \sum_{m=1}^M 2\left(\hat{y}^{[m]} - y^{[m]} \right) w_2 \hspace{0.2em}g^{(1)^{'}}\left(w_1 x^{[m]} + b_1 \right) \,$$ $$ \frac{\partial L_\text{MSE}}{\partial w_1} = \frac{1}{M} \sum_{m=1}^M \frac{\partial}{\partial w_1} \left[\left(\hat{y}^{[m]} - y^{[m]} \right)^2 \right]= \frac{1}{M} \sum_{m=1}^M 2\left(\hat{y}^{[m]} - y^{[m]} \right) w_2 \hspace{0.2em}g^{(1)^{'}}\left(w_1 x^{[m]} + b_1 \right) x^{[m]} \,.$$

I'm trying to find a combination of parameters (weights and biases) and a dataset (of any size $M > 1$) such that $\displaystyle \frac{\partial L_\text{MSE}}{\partial b_2}$ and $\displaystyle\frac{\partial L_\text{MSE}}{\partial w_2}$ are 0 but $\displaystyle \frac{\partial L_\text{MSE}}{\partial b_1}$ and $\displaystyle \frac{\partial L_\text{MSE}}{\partial w_1}$ are not equal to 0.

However, I've been struggling to solve this problem and find a combination that works. Any help would be much appreciated.

$\endgroup$
3
  • $\begingroup$ Does this answer your question? Gradients of lower layers of NN when gradient of an upper layer is 0? $\endgroup$ Commented Aug 17, 2023 at 15:25
  • $\begingroup$ @Broele Hi, it doesn't quite, no. In that question, you say "If we consider the (accumulated) gradients of a batch, then your assumption is not true." I am trying to find an example of this, which shows the upper layer having 0 gradients but not the lower layers, hence my question. $\endgroup$ Commented Aug 17, 2023 at 16:12
  • 1
    $\begingroup$ My mistake. I hope my answer helps. $\endgroup$ Commented Aug 17, 2023 at 21:58

1 Answer 1

2
$\begingroup$

There are simple examples. I will give one, here and later give an intuition how to find it.

Example

Set $w1=w2=1$ and $b1=b2=-1$.

with $M=3$, $x^{[m]}=(0,2,3)$ and $y^{[m]}=(-2,2,0)$

This leads to: $$\begin{align} a^{(1)^{[m]}} &= (0, 1, 2)\\ \hat{y}^{[m]} &= (-1, 0, 1)\\ (\hat{y}^{[m]} - y^{[m]}) &= (1,-2, 1)\\ \frac{\partial L_{\mathrm{MSE}}}{\partial b_2} &= \frac{2}{M}(1-2+1)=0\\ \frac{\partial L_{\mathrm{MSE}}}{\partial w_2} &= \frac{2}{M}(1\cdot 0-2\cdot 1+1\cdot 2)=0\\ \frac{\partial L_{\mathrm{MSE}}}{\partial b_1} &= \frac{2}{M}(1\cdot 0-2\cdot 1+1\cdot 1)=-\frac{2}{3}\\ \frac{\partial L_{\mathrm{MSE}}}{\partial w_1} &= \frac{2}{M}(1\cdot 0\cdot 0-2\cdot 1\cdot 2+1\cdot 1\cdot 3)=-\frac{2}{3}\\ \end{align}$$

How to find it
  1. First observation is, that for each sample every gradient is multiplied by $2(\hat{y}-y)$. So we can start by setting $y=\hat{y}-0.5$ to set this factor to $1$ (we will later change it)

  2. Now we need to find a setting ($w_1,w_2,b_1,b_2$ and samples $x^{[m]}$ so that there is a solution for the linear equation $$\lambda^{[m]}\cdot\left[\frac{\partial L_{\mathrm{MSE}}}{\partial w_1}, \frac{\partial L_{\mathrm{MSE}}}{\partial b_1}, \frac{\partial L_{\mathrm{MSE}}}{\partial w_2}, \frac{\partial L_{\mathrm{MSE}}}{\partial b_2}\right]=(r_1,r_2,0,0)$$ with $r_1\not=0\not=r_2$

  3. If $a^{(1)}=0$ for a sample, then every gradient, exept the one for $b_2$ is zero. This means we can reduce the linear equation above and ignore $\frac{\partial L_{\mathrm{MSE}}}{\partial b_2}$ for the moment. We later add a sample that moves the gradient to 0 ($x=0$ in our example).

  4. $w_1$ and $w_2$ are just linear factors and can mostly be ignored for building an example, so we set them to $1$.

  5. The value of $b_2$ doesn't matter for the gradients (if we can choose $y$ freely)

  6. This is trial and error. Try to find $x_1$, $x_2$, $b_1$, so that a linear combination of the gradients leads to $(r_1,r_2,0)$

  7. Add the sample for $b_2$ (see step 3) with a fitting weight.

  8. Change $y$ so that it fits to the weights of the linear combination that we found.

$\endgroup$
2
  • $\begingroup$ This is perfect, exactly what I was looking for! Very cool approach too, I definitely wouldn't have thought of that myself. Thanks a lot :) $\endgroup$ Commented Aug 18, 2023 at 10:24
  • 1
    $\begingroup$ Note that with a different architecture (e.g. classification with sigmoid output and cross-entropy loss, or different activation functions) the approach might not work. I am even not sure that there exists an example in every case. $\endgroup$ Commented Aug 18, 2023 at 17:26

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.