Skip to main content

You are not logged in. Your edit will be placed in a queue until it is peer reviewed.

We welcome edits that make the post easier to understand and more valuable for readers. Because community members review edits, please try to make the post substantially better than how you found it, for example, by fixing grammar or adding additional resources and hyperlinks.

Required fields*

18
  • $\begingroup$ very insightful answer! regarding the task of sequence classification, do you mean that only the final step loss will be computed, because at each step t there will only be the hidden state $h_t$ computed? $\endgroup$ Commented Mar 29, 2022 at 23:37
  • $\begingroup$ Sorry, It seems I didn't pay attention to the sum near the product term. Even if ∂hi/∂hi−1 = 0 at some i (or time step), you need to repeat this multiple times through all the time steps. So, the gradient may still not vanish entirely because it does not necessarily mean that zi=ri=0 is true for every time step. Remember z and r values depend on the input, weights, sigmoid activation, hidden state values, and bias values. Please go to this source (dbs.ifi.lmu.de/Lehre/DLAI/WS18-19/script/05_rnns.pdf) and just look at slide 70. It's BPTT for simple RNN but, still useful to look at. $\endgroup$ Commented Apr 1, 2022 at 0:06
  • 1
    $\begingroup$ To answer your 2nd question: Using GRU, your hope is that you can learn the long term dependency in a given task but, when you update your weights it's possible to encounter a "vanishing gradient" issue at some time step t. Check the very first link I sent, they explain how this vanishing gradient may happen in GRU. In my opinion, the real question should be: after encountering this vanishing gradient problem at certain time step t, can the gradient descent algorithm change the weights such that the real dynamics within the task can be learned? I can't answer this question yet myself. $\endgroup$ Commented Apr 4, 2022 at 2:57
  • 1
    $\begingroup$ To answer your 3rd question: another question you should ask yourself is: Should GRU learn to forget? In certain tasks, it may be useful to learn to forget or ignore certain inputs. In other words, you may not want certain inputs at certain time steps to affect your final answer. Think of a signal of a moving creature and GRU trying to guess its age. Let's say it should really pay attention to max speed to accurately guess the age and one of the creatures decides to stop for a long time. GRU should learn to pay attention to the max speed part of the sequence or learn to ignore certain inputs $\endgroup$ Commented Apr 4, 2022 at 3:17
  • 1
    $\begingroup$ Great, when I read their paper their arguments made sense to me however, if you find out that for some reason it doesn't make sense and you actually find out why it doesn't make sense, I would also really appreciate it if you can let me know about your findings. Also, not sure about your end goal but, if you are dealing with language tasks it may not be a bad idea to look at BERT or similar as they seem to be the current state of the art especially for language tasks. Additionally, remember that there is also exploding gradient issues (gradient being higher is not necessarily a good thing). $\endgroup$ Commented Apr 4, 2022 at 18:37