My understanding is that all tokens are passed to a transformer at once, and positional encodings help it understand their order in the sequence. And cosine type of positional encoding helps capture the short-term and long-term dependencies between the tokens (this is due to different frequencies for the cosine functions).
First doubt: why doesn't word ordering implicitly emerge from the multi-head attention mechanism? My thought was that after training a transformer on tons of data, the mechanism would eventually be able to look at an unordered set of words and give it some order? Another thing: isn't order implicit in the way the rows of the Q, K, V and embedding matrices are ordered? i.e. 1st row is for 1st token, i-th row for i-th token, etc.
Or is it that using positional encodings, we're able to take some of that "learning load" off of the attention mechanism, leaving it to focus on inter-token dependencies? This seems like a very hand-wavy explanation, hence I thought I should clarify with you all.
Second doubt: even if positional encoding is needed, why can't we just use the normalized position of the token in the sequence? e.g. "My name is Bob" can become [0, 1, 2, 3] and this vector can be normalized to get the positional encoding. Multi-head self attention definitely is more than capable of capturing short, medium and long term token dependencies. So why do the effort of using cosine encodings?