Questions tagged [backpropagation]
Backpropagation, an abbreviation for "backward propagation of errors", is a common method of training artificial neural networks used in conjunction with an optimization method such as gradient descent.
503 questions
1 vote
1 answer
56 views
Bayes-by-backprop - meaning of partial derivative
The Google Deepmind paper "Weight Uncertainty in Neural Networks" features the following algorithm: Note that the $\frac{∂f(w,θ)}{∂w}$ term of the gradients for the mean and standard ...
0 votes
0 answers
44 views
Confusion on same-sign gradients problem of Sigmoid function
I'm trying to wrap my head around the problem of same-sign gradients when using sigmoid activation function in a deep neural network. The problem emerges from the fact that sigmoid can only be ...
4 votes
1 answer
81 views
Weight Gradient Dimensions in LSTM Backpropagation
In an LSTM(regression), the output gate is defined as: $$o_t = \sigma\left(W_o x_t + U_o h_{t-1} + b_o \right),$$ where: $W_o \in \mathbb{R}^{m \times d}$ is the input weight matrix, $U_o \in \mathbb{...
3 votes
2 answers
125 views
Question on RNNs lookback window when unrolling
I will use the answer here as an example: https://stats.stackexchange.com/a/370732/78063 It says "which means that you choose a number of time steps $N$, and unroll your network so that it ...
4 votes
1 answer
66 views
How to prove that Q of the attention mechanism represents the 'search intent'?
It is said that the $Q$ Q represents "search intent" and K represents the "available information" in the attention mechanism. $\text{Attention}(Q,K,V)=\text{softmax}\left(\frac{QK^...
0 votes
0 answers
54 views
Understanding Backpropagation in Convolutional layer
I need help understanding backpropagation in the convolutional layer. From what I know so far, the forward phase is as follows: where, the tensor $A_{3\times3\times1}$ refers to the feature map in ...
0 votes
0 answers
36 views
Score Matching Algorithim
I've been reading about score matching and I have a very basic question about how one would (naively) implement the algorithm via gradient descent. Say I have some sort of neural network that that ...
3 votes
1 answer
124 views
Check through calculations whether the gradients will explode or vanish
I'm reviewing old exam questions and came across this one: Consider a regular MLP (multi-layer perceptron) architecture with 10 fully connected layers with ReLU activation function. The input to the ...
0 votes
0 answers
89 views
Analytically solving backpropagation through time for a simple gated RNN
Consider the following simple gated RNN: \begin{aligned} c_{t} &= \sigma\bigl(W_{c}\,x_{t} + W_{z}\,z_{t-1}\bigr) \\[6pt] z_{t} &= c_{t} \,\odot\, z_{t-1} \;\;+\;\; (1 - c_{t}) \,\odot\,\...
1 vote
0 answers
62 views
Do weights update less towards the start of a neural network?
That is, because the error is coming from the end of the neural network (ie at the output layer) and trickles back via backpropagation to the start of the neural network, does that mean that the ...
1 vote
0 answers
45 views
Batch Normalization and the effect of scaled weights on the gradients
I have been reading the following paper: https://arxiv.org/pdf/1706.05350, and I am having a hard time with some claims and derivations made in the paper. First of all, the main thing I am interested ...
0 votes
0 answers
55 views
"Inflating" learning rates in diminishing gradient areas for NN training
In neural net training, nowadays tanh and sigmoid activation functions in hidden layers are avoided as they tend to "saturate" easily. Meaning, if the x value plugged into tanh/sigmoid is ...
4 votes
1 answer
410 views
Questions on backpropagation in a neural net
I understand how to symbolically apply back propagation, calculate the formulas with pen and paper. When it comes to actually using these derivations on data, I have 2 questions: Suppose certain ...
4 votes
2 answers
205 views
Avoiding tensors when differentiating with respect to weight matrices in backpropagation
Consider a neural network consisting of only a single affine transformation with no non-linearity. Use the following notation: $\textbf{Inputs}: x \in \mathbb{R}^n$ $\textbf{Weights}: W \in \mathbb{R}...
1 vote
0 answers
48 views
Calculate gradient with chain rule using additions [closed]
I am taking Karpathy's course, specifically I am on the first video. There is a step in the development of micrograd that I don't fully understand. Specifically in this section, when he talks about ...