2
$\begingroup$

I have a question about the decoder transformer feed forward during training.

Let's pick an example: input data "i love the sun" traduction i want to predict (italian traduction) "io amo il sole".

Now i feed the encoder with the input "i love the sun" and i get the hidden states. Now i have to do multiple feed forwards on the decoder with the input "BOS io amo il" where BOS is a token that stands for beginning of sentence. So i have this feedforward i assume

  • [BOS, IO, AMO, IL] -> decoder -> IO
  • [BOS, IO, AMO, IL] -> decoder -> AMO
  • [BOS, IO, AMO, IL] -> decoder -> IL
  • [BOS, IO, AMO, IL] -> decoder -> SOLE

I think this is the correct way. And what should be applied to differentiate the training i think is the masked attention mechanism maybe(?) is it right to assume that the masking will be

[1 0 0 0, 0 0 0 0 , 0 0 0 0, 0 0 0 0] for the first feed forward [1 0 0 0, 1 1 0 0 , 0 0 0 0, 0 0 0 0] for the second feed forward [1 0 0 0, 1 1 0 0 , 1 1 1 0, 0 0 0 0] for the third feed forward [1 0 0 0, 1 1 0 0 , 1 1 1 0, 1 1 1 1] for the fourth feed forward 

is it the correct way? or what should be different? If you can provide me also a python implementation could be useful, thanks in advance.

$\endgroup$

1 Answer 1

2
$\begingroup$

There are some problems with your description:

  • During training, the decoder receives all the shifted target tokens, prepending the BOS token. You removed sole. The actual input would be: [<bos>, io, amo, il, sole]. Note that the output at the position of sole would be the end-of-sequence token <eos>.

  • During training, there is a single forward pass (not one per token), and all the output tokens are predicted at once. Therefore, only the last one of your attention masks is used.

  • During inference, we don't have the target tokens (because that is what we are trying to predict). In this case, we have one pass per generated token, starting with <bos>. This way, the decoder input in the first step would just be the sequence [<bos>], and we would predict the first token: io. Then, we would prepare the input for the next timestep as [<bos>, io], and then we would obtain the prediction for the second token. And so on. Note that, at each timestep, we are repeating the computations for the past positions; in real implementations, these states are cached instead of re-computed each timestep.

About some piece of Python code illustrating how the Transformer works, I suggest The annotated Transformer, which is a nice guide through a real implementation. You may be most interested in the function run_epoch for the training and in the function greedy_decode for the inference.

$\endgroup$
9
  • $\begingroup$ During inference the input of the tokens not still considered (in the future) should be all 0s? $\endgroup$ Commented Mar 5, 2021 at 11:12
  • $\begingroup$ During inference, the future input tokens simply do not exist yet. We don't know how long the sentence will be until we reach the <eos> prediction. The sequence length at the first timestep is 1, at the second timestep is 2, and so on, until <eos> is predicted. $\endgroup$ Commented Mar 5, 2021 at 11:38
  • $\begingroup$ Did something in the answer cause its unacceptance? $\endgroup$ Commented Mar 9, 2021 at 16:53
  • $\begingroup$ sorry i think i misclicked on it. But i have also another question, if we have to predict all at once during training means that there is a afeedforward network for each ouput token at the end of the decoder? $\endgroup$ Commented Mar 10, 2021 at 19:38
  • $\begingroup$ There is a single feedforward network for all of the positions, and it is applied for each of them (at the same time). $\endgroup$ Commented Mar 10, 2021 at 19:45

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.