Positional Encoding: one of the most clever engineering decisions in the Attention Is All You Need paper
Positional Encoding:
- Attention mechanism is permutation invariant. It is same for a given set of words. Eg: "dog bites man" and "man bites dog" would have same attention scores. ? Why? Worth tracing the formula on this example.
- To overcome this, we can add the position to the embedding. i.e now becomes , where is the positional encoding for position .
Approaches:
-
Naive: We concatenate increasing order of numbers to embeddings.
- eg: , , , ...
- Problem: Numbers like these can become large easily and can skew the attention scores. eg .
- Additionally, models doesn't have intuition of these discrete numbers like humans do.
- So we can't use discrete, unbounded values.
- We want continuous, bounded values that provide ordering or nuance of distance too.
-
Sine Function:
- They provide boudned, continuous values. Eg, , , , ...
- But, since they are periodic, there can be numbers such that , which can cause confusion. So uniqueness of position can't be maintained.
-
Two vector (Sine and Cosine):
- We can use two vectors for positional encoding.
- eg:
- This still doesn't ensure uniqueness. The pair repeats periodically.
-
Four vectors [, , , ] i.e with different frequencies:
- This further reduces the likelihood of repetition. is better than for this case too.
- If 4 vectors can't ensure uniquess?
- What if we use a long enough sequence such that, a unique positional encoding is generated for sufficiently long sequence?
- [, , , , , , ...] with different frequencies.
- 512 dimensional positional encoding
Mathematical Derivation of Approach 4:
- We have sin in even index and cos in odd index.
Where,
-
i: index of the dimension
-
pos: position of the word in the sentence
-
d_model: dimension of the model (embedding size): 512 in original transformer paper.
-
The authors hypothesize that, if we know , we can predic the for small .
i.e there exists a matrix, such that
Q: But why addition and not concatenation? With concatenation, we can have separate parameters for position and word. A: is already 512 dimension. Concatenating of 512 makes it 1024 dimension.
- We also need to multipiply the embedding with W which is also 512 x 512. Concatenating would make the compute requirements much higher.
- One solution is to use addition. But wouldn't positional encoding distort the word embedding?
- The sine and cosine functions have a definitive structure that don't interfere with the word embedding.
Q: What should look like? A: Unique,bounded magnitude, repeatable or deterministic and generalize to unseen sequence lengths.
Q: What kind of position should the model know? Absolute or relative? or both? What kind of patterns in language are position-dependent. A: Relavtive. I.e about distance and direction.