Questions tagged [backpropagation]
Use for questions about Backpropagation, which is commonly used in training Neural Networks in conjunction with an optimization method such as gradient descent.
301 questions
8 votes
2 answers
2k views
How does backpropagation in a transformer work?
Specifically to solve the problem of text generation, not translation. There is literally not a single discussion, blog post, or tutorial that explains the math behind this. My best guess so far is: ...
1 vote
1 answer
40 views
Does reducing the loss change the amount of change during backpropagation?
If I would do loss = loss/10 before calculating the gradient would that change the amount of change applied to the model parameters during back propagation? Or is ...
2 votes
1 answer
105 views
Deriving the gradient hidden to hidden weights for backpropagation through time in a reccurent neural network
I'm currently working on deriving the the gradients of a simple recurrent neural networks weights with respect to the loss to update the weights through backpropagation. It's a super simple network, ...
1 vote
0 answers
50 views
Backpropgation for a single parameter on a rather simple network
Given the following network: I'm asked to write the backpropagation process for the $b_3$ parameter, where the loss function is $L(y,z_3)=(z_3-y)^2$ I'm not supposed to calculate any of the weights ...
0 votes
1 answer
37 views
Why no scale parameter for skip connection addition?
For a simple skip connection $y = x@w + x$, the gradient dy/dx will be $w+1$. $$\frac {\partial y}{\partial x} = w +1$$ Is +1 a bit too large and can it overpower $...
0 votes
1 answer
157 views
Why not Back propagate through time in LSTM , similar to RNN
I'm trying to implement RNN and LSTM , many-to-many architecture. I reasoned myself why BPTT is necessary in RNNs and it makes sense. But what doesn't make sense to me is, most of resources I went ...
0 votes
0 answers
105 views
Doubts on a custom loss function for regression problems
From what I read, I know we don't use log loss or cross entropy for regression problems. However, the entire logic behind binary cross entropy(say) is to firstly squeeze the y_hat between 0 and 1 (...
1 vote
0 answers
72 views
Relu derivative value
I have a stupid question on the derivative of relu activation function. After the finding the difference of the true output $t_k$ and predicted output $a_k$, why is the value of the $d_{a3}$ \ $d_{z3}$...