Positional Encoding: one of the most clever engineering decisions in the Attention Is All You Need paper

Positional Encoding:

  • Attention mechanism is permutation invariant. It is same for a given set of words. Eg: "dog bites man" and "man bites dog" would have same attention scores. ? Why? Worth tracing the formula on this example.
  • To overcome this, we can add the position to the embedding. i.e eie_i now becomes ei+pie_i + p_i, where pip_i is the positional encoding for position ii.

Approaches:

  1. Naive: We concatenate increasing order of numbers to embeddings.

    • eg: ei+1e_i + 1, e(i+1)+2e_{(i+1)} + 2, e(i+2)+3e_{(i+2)} + 3, ...
    • Problem: Numbers like these can become large easily and can skew the attention scores. eg ei+10000e_i + 10000.
    • Additionally, models doesn't have intuition of these discrete numbers like humans do.
    • So we can't use discrete, unbounded values.
    • We want continuous, bounded values that provide ordering or nuance of distance too.
  2. Sine Function:

    • They provide boudned, continuous values. Eg, sin(1)\sin(1), sin(2)\sin(2), sin(3)\sin(3), ...
    • But, since they are periodic, there can be numbers such that sin(x)=sin(y)\sin(x) = \sin(y), which can cause confusion. So uniqueness of position can't be maintained.
  3. Two vector (Sine and Cosine):

    • We can use two vectors for positional encoding.
    • eg: pi=[sin(i),cos(i)]p_i = [\sin(i), \cos(i)]
    • This still doesn't ensure uniqueness. The [sin(i),cos(i)][\sin(i), \cos(i)] pair repeats periodically.
  4. Four vectors [sin\sin, cos\cos, sin(i/2)\sin(i/2), cos(i/2)\cos(i/2)] i.e with different frequencies:

    • This further reduces the likelihood of repetition. i/2i/2 is better than 2i2i for this case too.
    • If 4 vectors can't ensure uniquess?
    • What if we use a long enough sequence such that, a unique positional encoding is generated for sufficiently long sequence?
    • [sin(i)\sin(i), cos(i)\cos(i), sin(i/2)\sin(i/2), cos(i/2)\cos(i/2), sin(i/3)\sin(i/3), cos(i/3)\cos(i/3), ...] with different frequencies.
    • 512 dimensional positional encoding

Mathematical Derivation of Approach 4:

  • We have sin in even index and cos in odd index.
PE(pos,2i)=sin(pos100002i/dmodel)PE(pos,2i+1)=cos(pos100002i/dmodel)\begin{aligned} PE(pos, 2i) &= \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right) \\ PE(pos, 2i+1) &= \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right) \end{aligned}

Where,

  • i: index of the dimension

  • pos: position of the word in the sentence

  • d_model: dimension of the model (embedding size): 512 in original transformer paper.

  • The authors hypothesize that, if we know PE(pos)PE_{(pos)}, we can predic the PE(pos+k)PE_{(pos+k)} for small kk.

i.e there exists a TkT_k matrix, such that PE(pos+k)=TkPE(pos)PE_{(pos+k)} = T_k \cdot PE_{(pos)}

Q: But why addition and not concatenation? With concatenation, we can have separate parameters for position and word. A: eie_i is already 512 dimension. Concatenating pip_i of 512 makes it 1024 dimension.

  • We also need to multipiply the embedding with W which is also 512 x 512. Concatenating would make the compute requirements much higher.
  • One solution is to use addition. But wouldn't positional encoding distort the word embedding?
  • The sine and cosine functions have a definitive structure that don't interfere with the word embedding.

Q: What should pip_i look like? A: Unique,bounded magnitude, repeatable or deterministic and generalize to unseen sequence lengths.

Q: What kind of position should the model know? Absolute or relative? or both? What kind of patterns in language are position-dependent. A: Relavtive. I.e about distance and direction.