Why do we need cosine positional encoding in multi-head attention based transformer?

Question

My understanding is that all tokens are passed to a transformer at once, and positional encodings help it understand their order in the sequence. And cosine type of positional encoding helps capture the short-term and long-term dependencies between the tokens (this is due to different frequencies for the cosine functions).

First doubt: why doesn't word ordering implicitly emerge from the multi-head attention mechanism? My thought was that after training a transformer on tons of data, the mechanism would eventually be able to look at an unordered set of words and give it some order? Another thing: isn't order implicit in the way the rows of the Q, K, V and embedding matrices are ordered? i.e. 1st row is for 1st token, i-th row for i-th token, etc.

Or is it that using positional encodings, we're able to take some of that "learning load" off of the attention mechanism, leaving it to focus on inter-token dependencies? This seems like a very hand-wavy explanation, hence I thought I should clarify with you all.

Second doubt: even if positional encoding is needed, why can't we just use the normalized position of the token in the sequence? e.g. "My name is Bob" can become [0, 1, 2, 3] and this vector can be normalized to get the positional encoding. Multi-head self attention definitely is more than capable of capturing short, medium and long term token dependencies. So why do the effort of using cosine encodings?

bogovicj · Accepted Answer · 2025-01-14 21:40:06Z

Only addressing the first question:

Transformers are invariant to permutation (see below). As a result, they can not distinguish between "inputs that are permuted" and "are permuted that inputs". The positional encoding is introduced specifically so that the order of the input tokens matters.

I don't have an answer for your second point "why can't we just use the normalized position". I'm sure you can use it, but expect that using the cosine encoding may have given in empirically better results...

Transformers are permutation invariant

See these slides for an explanation as to why transformers are invariant to permutations. Specifically this part on slide 8:

The standard attention operation is invariant to a permutation of the keys and values:

$Y(Q, σ(K), σ(V)) = Y(Q, K, V ),$

and equivariant to a permutation of the queries, that is the resulting tensor is permuted similarly

$Y(σ(Q), K, V ) = σ(Y(Q, K, V)).$

Consequently self attention and cross attention are equivariant to permutations of X , and cross attention is invariant to permutations of X ′.

Stack Exchange Network

Why do we need cosine positional encoding in multi-head attention based transformer?

1 Answer 1

Transformers are permutation invariant

Hot Network Questions

Why do we need cosine positional encoding in multi-head attention based transformer?

1 Answer 1

Transformers are permutation invariant

Related

Hot Network Questions