Bumped by Community user

occurred Jul 3, 2020 at 19:08

Bumped by Community user

occurred Mar 5, 2020 at 16:01

Bumped by Community user

occurred Nov 6, 2019 at 15:02

added 1 character in body

Source Link

edited Jun 30, 2019 at 2:55

Pablo Messina

197
1
3
11

I want to replicate the Transformer from the paper Attention Is All You Need in PyTorch. My question is about the decoder branch of the Transformer. If I understand correctly, given a sentence in the source language and partial/incomplete translation in the target language, the Transformer is tasked with predicting the next token in the translation. For example:

English (source): I love eating chocolate

Spanish (target): Yo amo comer ...

The next token in the translation should be chocolate (thus, the full translation would be "Yo amo comer chocolate"). So, in this example, the encoder would process the sentence "I love eating chocolate", the decoder would process the partial translation "Yo amo comer" and the final output would be a softmax over the whole vocabulary (hopefully with chocolate being the token with the highest score).

The issue is, during training we want the Transformer to learn the full translation. This means that if the target sentence has length N, we want the transformer to predict the first word, the second word, the third word, and so on all the way up to the N-th word of the target sentence. One way to do that is by generating N training instances (one for each word of a target sentence of length-N). However this approach is computationally quite expensive, because instead of having a single training instance per source-target pair, we now have as many training instances as there are words in all the target sentences. So I was wondering if it's possible make all predictions for all wordwords in the target sentence in a single forward pass. Is that possible?

I want to replicate the Transformer from the paper Attention Is All You Need in PyTorch. My question is about the decoder branch of the Transformer. If I understand correctly, given a sentence in the source language and partial/incomplete translation in the target language, the Transformer is tasked with predicting the next token in the translation. For example:

English (source): I love eating chocolate

Spanish (target): Yo amo comer ...

The next token in the translation should be chocolate (thus, the full translation would be "Yo amo comer chocolate"). So, in this example, the encoder would process the sentence "I love eating chocolate", the decoder would process the partial translation "Yo amo comer" and the final output would be a softmax over the whole vocabulary (hopefully with chocolate being the token with the highest score).

The issue is, during training we want the Transformer to learn the full translation. This means that if the target sentence has length N, we want the transformer to predict the first word, the second word, the third word, and so on all the way up to the N-th word of the target sentence. One way to do that is by generating N training instances (one for each word of a target sentence of length-N). However this approach is computationally quite expensive, because instead of having a single training instance per source-target pair, we now have as many training instances as there are words in all the target sentences. So I was wondering if it's possible make all predictions for all word in the target sentence in a single forward pass. Is that possible?

I want to replicate the Transformer from the paper Attention Is All You Need in PyTorch. My question is about the decoder branch of the Transformer. If I understand correctly, given a sentence in the source language and partial/incomplete translation in the target language, the Transformer is tasked with predicting the next token in the translation. For example:

English (source): I love eating chocolate

Spanish (target): Yo amo comer ...

The next token in the translation should be chocolate (thus, the full translation would be "Yo amo comer chocolate"). So, in this example, the encoder would process the sentence "I love eating chocolate", the decoder would process the partial translation "Yo amo comer" and the final output would be a softmax over the whole vocabulary (hopefully with chocolate being the token with the highest score).

The issue is, during training we want the Transformer to learn the full translation. This means that if the target sentence has length N, we want the transformer to predict the first word, the second word, the third word, and so on all the way up to the N-th word of the target sentence. One way to do that is by generating N training instances (one for each word of a target sentence of length-N). However this approach is computationally quite expensive, because instead of having a single training instance per source-target pair, we now have as many training instances as there are words in all the target sentences. So I was wondering if it's possible make all predictions for all words in the target sentence in a single forward pass. Is that possible?

Source Link

asked Jun 30, 2019 at 2:44

Pablo Messina

197
1
3
11

Transformer for neural machine translation: is it possible to predict each word in the target sentence in a single forward pass?

I want to replicate the Transformer from the paper Attention Is All You Need in PyTorch. My question is about the decoder branch of the Transformer. If I understand correctly, given a sentence in the source language and partial/incomplete translation in the target language, the Transformer is tasked with predicting the next token in the translation. For example:

English (source): I love eating chocolate

Spanish (target): Yo amo comer ...

The next token in the translation should be chocolate (thus, the full translation would be "Yo amo comer chocolate"). So, in this example, the encoder would process the sentence "I love eating chocolate", the decoder would process the partial translation "Yo amo comer" and the final output would be a softmax over the whole vocabulary (hopefully with chocolate being the token with the highest score).

The issue is, during training we want the Transformer to learn the full translation. This means that if the target sentence has length N, we want the transformer to predict the first word, the second word, the third word, and so on all the way up to the N-th word of the target sentence. One way to do that is by generating N training instances (one for each word of a target sentence of length-N). However this approach is computationally quite expensive, because instead of having a single training instance per source-target pair, we now have as many training instances as there are words in all the target sentences. So I was wondering if it's possible make all predictions for all word in the target sentence in a single forward pass. Is that possible?

Stack Exchange Network

Return to Question

Transformer for neural machine translation: is it possible to predict each word in the target sentence in a single forward pass?