Skip to main content
Bumped by Community user
Bumped by Community user
Bumped by Community user
added 1 character in body
Source Link
Pablo Messina
  • 197
  • 1
  • 3
  • 11

I want to replicate the Transformer from the paper Attention Is All You Need in PyTorch. My question is about the decoder branch of the Transformer. If I understand correctly, given a sentence in the source language and partial/incomplete translation in the target language, the Transformer is tasked with predicting the next token in the translation. For example:

English (source): I love eating chocolate

Spanish (target): Yo amo comer ...

The next token in the translation should be chocolate (thus, the full translation would be "Yo amo comer chocolate"). So, in this example, the encoder would process the sentence "I love eating chocolate", the decoder would process the partial translation "Yo amo comer" and the final output would be a softmax over the whole vocabulary (hopefully with chocolate being the token with the highest score).

The issue is, during training we want the Transformer to learn the full translation. This means that if the target sentence has length N, we want the transformer to predict the first word, the second word, the third word, and so on all the way up to the N-th word of the target sentence. One way to do that is by generating N training instances (one for each word of a target sentence of length-N). However this approach is computationally quite expensive, because instead of having a single training instance per source-target pair, we now have as many training instances as there are words in all the target sentences. So I was wondering if it's possible make all predictions for all wordwords in the target sentence in a single forward pass. Is that possible?

I want to replicate the Transformer from the paper Attention Is All You Need in PyTorch. My question is about the decoder branch of the Transformer. If I understand correctly, given a sentence in the source language and partial/incomplete translation in the target language, the Transformer is tasked with predicting the next token in the translation. For example:

English (source): I love eating chocolate

Spanish (target): Yo amo comer ...

The next token in the translation should be chocolate (thus, the full translation would be "Yo amo comer chocolate"). So, in this example, the encoder would process the sentence "I love eating chocolate", the decoder would process the partial translation "Yo amo comer" and the final output would be a softmax over the whole vocabulary (hopefully with chocolate being the token with the highest score).

The issue is, during training we want the Transformer to learn the full translation. This means that if the target sentence has length N, we want the transformer to predict the first word, the second word, the third word, and so on all the way up to the N-th word of the target sentence. One way to do that is by generating N training instances (one for each word of a target sentence of length-N). However this approach is computationally quite expensive, because instead of having a single training instance per source-target pair, we now have as many training instances as there are words in all the target sentences. So I was wondering if it's possible make all predictions for all word in the target sentence in a single forward pass. Is that possible?

I want to replicate the Transformer from the paper Attention Is All You Need in PyTorch. My question is about the decoder branch of the Transformer. If I understand correctly, given a sentence in the source language and partial/incomplete translation in the target language, the Transformer is tasked with predicting the next token in the translation. For example:

English (source): I love eating chocolate

Spanish (target): Yo amo comer ...

The next token in the translation should be chocolate (thus, the full translation would be "Yo amo comer chocolate"). So, in this example, the encoder would process the sentence "I love eating chocolate", the decoder would process the partial translation "Yo amo comer" and the final output would be a softmax over the whole vocabulary (hopefully with chocolate being the token with the highest score).

The issue is, during training we want the Transformer to learn the full translation. This means that if the target sentence has length N, we want the transformer to predict the first word, the second word, the third word, and so on all the way up to the N-th word of the target sentence. One way to do that is by generating N training instances (one for each word of a target sentence of length-N). However this approach is computationally quite expensive, because instead of having a single training instance per source-target pair, we now have as many training instances as there are words in all the target sentences. So I was wondering if it's possible make all predictions for all words in the target sentence in a single forward pass. Is that possible?

Source Link
Pablo Messina
  • 197
  • 1
  • 3
  • 11

Transformer for neural machine translation: is it possible to predict each word in the target sentence in a single forward pass?

I want to replicate the Transformer from the paper Attention Is All You Need in PyTorch. My question is about the decoder branch of the Transformer. If I understand correctly, given a sentence in the source language and partial/incomplete translation in the target language, the Transformer is tasked with predicting the next token in the translation. For example:

English (source): I love eating chocolate

Spanish (target): Yo amo comer ...

The next token in the translation should be chocolate (thus, the full translation would be "Yo amo comer chocolate"). So, in this example, the encoder would process the sentence "I love eating chocolate", the decoder would process the partial translation "Yo amo comer" and the final output would be a softmax over the whole vocabulary (hopefully with chocolate being the token with the highest score).

The issue is, during training we want the Transformer to learn the full translation. This means that if the target sentence has length N, we want the transformer to predict the first word, the second word, the third word, and so on all the way up to the N-th word of the target sentence. One way to do that is by generating N training instances (one for each word of a target sentence of length-N). However this approach is computationally quite expensive, because instead of having a single training instance per source-target pair, we now have as many training instances as there are words in all the target sentences. So I was wondering if it's possible make all predictions for all word in the target sentence in a single forward pass. Is that possible?