Positional Encoding: one of the most clever engineering decisions in the Attention Is All You Need paper

June 3, 2026 · positional-encoding, transformers

Positional Encoding:

Attention mechanism is permutation invariant. It is same for a given set of words. Eg: "dog bites man" and "man bites dog" would have same attention scores. ? Why? Worth tracing the formula on this example.
To overcome this, we can add the position to the embedding. i.e $e_i$ now becomes $e_i + p_i$ , where $p_i$ is the positional encoding for position $i$ .

Approaches:

Naive: We concatenate increasing order of numbers to embeddings.
- eg: $e_i + 1$ , $e_{(i+1)} + 2$ , $e_{(i+2)} + 3$ , ...
- Problem: Numbers like these can become large easily and can skew the attention scores. eg $e_i + 10000$ .
- Additionally, models doesn't have intuition of these discrete numbers like humans do.
- So we can't use discrete, unbounded values.
- We want continuous, bounded values that provide ordering or nuance of distance too.
Sine Function:
- They provide boudned, continuous values. Eg, $\sin(1)$ , $\sin(2)$ , $\sin(3)$ , ...
- But, since they are periodic, there can be numbers such that $\sin(x) = \sin(y)$ , which can cause confusion. So uniqueness of position can't be maintained.
Two vector (Sine and Cosine):
- We can use two vectors for positional encoding.
- eg: $p_i = [\sin(i), \cos(i)]$
- This still doesn't ensure uniqueness. The $[\sin(i), \cos(i)]$ pair repeats periodically.
Four vectors [ $\sin$ , $\cos$ , $\sin(i/2)$ , $\cos(i/2)$ ] i.e with different frequencies:
- This further reduces the likelihood of repetition. $i/2$ is better than $2i$ for this case too.
- If 4 vectors can't ensure uniquess?
- What if we use a long enough sequence such that, a unique positional encoding is generated for sufficiently long sequence?
- [ $\sin(i)$ , $\cos(i)$ , $\sin(i/2)$ , $\cos(i/2)$ , $\sin(i/3)$ , $\cos(i/3)$ , ...] with different frequencies.
- 512 dimensional positional encoding

Mathematical Derivation of Approach 4:

We have sin in even index and cos in odd index.

\begin{aligned} PE(pos, 2i) &= \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right) \\ PE(pos, 2i+1) &= \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right) \end{aligned}

Where,

i: index of the dimension
pos: position of the word in the sentence
d_model: dimension of the model (embedding size): 512 in original transformer paper.
The authors hypothesize that, if we know $PE_{(pos)}$ , we can predic the $PE_{(pos+k)}$ for small $k$ .

i.e there exists a $T_k$ matrix, such that $PE_{(pos+k)} = T_k \cdot PE_{(pos)}$

Q: But why addition and not concatenation? With concatenation, we can have separate parameters for position and word. A: $e_i$ is already 512 dimension. Concatenating $p_i$ of 512 makes it 1024 dimension.

We also need to multipiply the embedding with W which is also 512 x 512. Concatenating would make the compute requirements much higher.
One solution is to use addition. But wouldn't positional encoding distort the word embedding?
The sine and cosine functions have a definitive structure that don't interfere with the word embedding.

Q: What should $p_i$ look like? A: Unique,bounded magnitude, repeatable or deterministic and generalize to unseen sequence lengths.

Q: What kind of position should the model know? Absolute or relative? or both? What kind of patterns in language are position-dependent. A: Relavtive. I.e about distance and direction.