Positional Encoding

Sinusoids, RoPE, and why order matters.

Hard

~15 min read

·lesson 4 of 4

Picture the attention layer as a theater. Every token in your sentence is an audience member, and they're all shouting at once. The model's job is to listen — to figure out who's responding to whom, which voice belongs with which. So far so good. Now picture the same theater with the seats unmarked. No row letters. No numbers on the chairs. Just a crowd.

That's what attention sees. The self-attention operation is a weighted sum over the value vectors of every token in the sequence, and the weights come from dot products between queries and keys. None of that arithmetic cares about where a token sits. Shuffle every audience member in the room and the output comes out identical — just permuted. Attention, by construction, is a set operation. An unseated crowd.

That's fine for bag-of-words sentiment, where order is discardable. It is catastrophic for language. “Dog bites man” and “man bites dog” use the same three words and mean opposite things. If a transformer is to tell them apart, every audience member needs a sticker stapled to their shirt saying you are in row 3. Positional encoding is that sticker. How you design the sticker — the recipe that turns “row 3” into a vector the model can read — is the entire subject of this lesson.

Here's the first thing everyone tries and nobody ships: just staple a number to each seat. Row 0 gets the scalar 0, row 1 gets 1, row 511 gets 511. Clean, simple, done. It breaks almost immediately. A scalar of 511 has a vastly different magnitude from a scalar of 3, and you're adding this thing directly into a word embedding whose values sit in a tight range. The position number drowns the meaning. Normalize it to [0, 1]? Now the gap between position 5 and position 6 depends on how long the sequence is — the sticker changes meaning if someone walks in late.

The fix Vaswani et al. (2017) chose is worth staring at, because it solves a specific problem: seat stickers need to be comparable at a distance. If the model is going to learn “pay attention to the word three rows back,” it has to be able to compute “three rows back” from whatever the sticker actually is. Not a scalar. A pattern. Something geometric.

For token position pos and embedding dimension index i:

sinusoidal positional encoding — Vaswani et al. 2017

PE(pos, 2i)     =  sin( pos / 10000^(2i/d) )

PE(pos, 2i+1)   =  cos( pos / 10000^(2i/d) )

Read that with the theater in mind. Each embedding dimension is one color on the sticker, each with its own frequency. Dimension 0 oscillates fast — its stripes repeat every ~6 seats. Dimension d−1 oscillates slowly — its wavelength is about 2π · 10000 seats. Between them you get a geometric sweep of wavelengths, every dimension tuned to a different resolution of position. The full vector at one seat is a unique combination of stripe colors and patterns — a barcode nobody else in the audience is wearing.

Two properties make this sticker clever rather than arbitrary. First: every seat is unique. The high-frequency dimensions are the seconds hand, the low-frequency ones are the hour hand, and together they stamp a one-of-a-kind barcode on every row for any reasonable sequence length. Second, and this is the one that earns its keep: because of the trig identities sin(a+b) = sin a cos b + cos a sin b and cos(a+b) = cos a cos b − sin a sin b, the sticker at seat pos + k is a linear function of the sticker at seat pos. The model can, in principle, learn a single linear operator that means “attend to the audience member k rows back.” Relative seating is baked into the geometry of the sticker, not just into the number.

sinusoidal positional encoding — red = +1, blue = −1

PE[pos, 2i] = sin(pos / 10000^(2i/d)) · PE[pos, 2i+1] = cos(...)

row 0: PE[0, :] — click the heatmap to pick a row

max_pos64d_model128

row0

period at dim 2≈ 6.3

That heatmap is the seating chart. Each column is one row of the theater (0 through 63); each row of the heatmap is one embedding dimension (0 through 127). The stripes run fast on top to slow on the bottom — exactly the geometric sweep the formula prescribes. Drag across columns and watch: neighbouring seats wear nearly-identical stickers, seats far apart wear wildly different ones. That gradient of similarity is the signal the model reads to work out who's sitting where.

The sticker is added to the token embedding, not concatenated — we're gluing “what the word means” and “which row it's in” into the same vector. First reaction: won't the seat pattern contaminate the meaning? In practice it doesn't, for roughly the reason two different radio stations on different frequencies can share one wire — different frequencies interfere minimally. The network learns, through training, to disentangle the two channels.

Position (personified)

The crowd forgot me. Without my sticker, every permutation of your sentence looks the same to attention — a bag of meanings, no grammar. I arrive as a vector of oscillations, a barcode stapled to each audience member's shirt, and once I'm added the network can finally tell row A from row Z. I am the order you forgot to include.

The Vaswani recipe is elegant but fixed. The next obvious move is to stop hand-designing the sticker and let the model print its own: allocate nn.Embedding(max_len, d), initialise randomly, backprop. This is a learned positional embedding, and it's what BERT and GPT-2 use. Every seat in the theater gets its own blank sticker at the start of training, and gradient descent colors them in.

The upside is simplicity — it's just another embedding table — and the model can, in theory, invent a barcode better suited to the data than anything a human would pick. The downside is harsh: a table of size max_len × d has no idea what to do with seat max_len + 1. Train BERT with 512 rows and you cannot suddenly seat 1024 audience members — there's literally no sticker printed for row 513. You can extend the table and keep training, but the new rows start from random and the model has never seen them. Sinusoidal encodings are just a formula; you can evaluate them at seat 1,000,000 and get a well-defined vector, even if no sequence that long has ever been shown to the model.

Rotary Position Embedding (Su et al., 2021) is the positional encoding LLaMA uses. And Gemma. And Mistral. And most modern decoder-only transformers that don't use ALiBi. By installed base, it's the dominant seat-sticker of 2024, and it's cleverer than it looks.

The idea: instead of stapling a sticker to the audience member,spin the audience member on a turntable by an angle that depends on their row. Take the query and key vectors, split each d-dimensional vector into d/2 two-dimensional pairs, and rotate each pair by an angle proportional to its position — with a different frequency per pair, the same geometric sweep as before. In 2D, rotating(x₁, x₂) by angle θ gives:

RoPE — rotate the query/key pair by position-dependent angle

R(θ) · (x₁, x₂)  =  ( x₁ cos θ  −  x₂ sin θ ,
                       x₁ sin θ  +  x₂ cos θ )

θₘ    =   m · ωᵢ     ← position m times per-pair frequency ωᵢ
ωᵢ    =   1 / 10000^(2i/d)

The magic is what this does to the inner product. Rotate query q at seat m and key k at seat n, then take their dot product:

RoPE's load-bearing identity

⟨ R(mω) · q ,  R(nω) · k ⟩    =    ⟨ q ,  R((n − m)ω) · k ⟩

   →   the similarity score depends only on the
       relative offset (n − m), not on m or n alone

Stop and absorb that. The attention score between two audience members depends only on how far apart they're seated, not on whether they're in rows 3 and 5 or rows 103 and 105. That's exactly the property the theater wanted: seats whose relationships are comparable at a distance, no matter where in the room they are. It falls out of rotating Q and K together as a literal algebraic consequence — no seat-sticker addition, no learned bias, no embedding table. Just a rotation applied at each attention layer.

RoPE — position becomes a rotation angle

θ_i = pos · 10000^(-2i/d) · q_rot = R(θ) · q

R(θ) — live values

q = [0.90, 0.30]

R·q = [0.920, -0.230]

why this works

each dim-pair rotates at its own rate; high-i pairs rotate slowly, low-i pairs rotate fast. the inner product depends on pos_q − pos_k only — pure relative position.

i = 0period ≈ 6

i = 8period ≈ 63

i = 24period ≈ 6283

pos12dim-pair i0 / 31

θ12.000 rad

period6.3

The 2D pair on the left is a query; the one on the right is a key. Both start in row 0. Advance them both by the same amount — say, push them ten rows back together — and the angle between them stays identical, so the dot product is unchanged. Advance only one and the angle shifts. That shift is the relative-position signal the model reads. The inner product isn't blind to position — it's blind to absolute position, tuned to relative position. That distinction is the whole game.

Rotation (personified)

The other encodings staple a sticker to your shirt. I spin your query and your key on the same turntable, and when you take the inner product the shared rotation cancels — only the difference survives. I don't whisper “you are in row seven” to the model; I whisper “you are three rows ahead of that other audience member,” which, it turns out, is all the model actually wanted to know.

Which brings us to the fourth option, the contrarian one. What if you skip the seat sticker entirely and just… penalize audience members for shouting at people in faraway rows? That's ALiBi (Press, Smith, Lewis 2021). After computing the raw attention score QKᵀ, subtract a penalty proportional to how many rows apart the two seats are:

ALiBi — no embedding, just a distance penalty

score(m, n)   =   qₘ · kₙ   −   λ · |m − n|

 (λ is a per-head scalar, chosen on a fixed geometric schedule)

That's the whole encoding. Audience members attend less to seats far away simply because the score gets more negative as the row-distance grows. Each attention head gets its own λ — some heads use a steep penalty (short-range heads, focused on the nearest few rows), others a shallow one (long-range heads, happy to listen across the theater). No added stickers, no learned table, no rotations. The paper's headline finding: a model trained with ALiBi at 1024 rows can be deployed at 2048 with almost no degradation. Length extrapolation, nearly for free.

Three implementations of the same Vaswani recipe. Pure Python first — one scalar at a time, the math directly transcribed. Then NumPy, where the broadcasting idea clicks and one seat's worth of arithmetic becomes the whole theater at once. Then PyTorch, where we drop in a minimal RoPE block so you can see both stickers side by side.

layer 1 — pure python · sinusoidal_pe_scratch.py

python

import math

def positional_encoding_scalar(pos, i, d):
    # one scalar of the PE matrix — position pos, embedding dim i
    # pairs: (2i, 2i+1) share a frequency, sin on the even, cos on the odd
    angle = pos / (10000 ** (2 * (i // 2) / d))
    if i % 2 == 0:
        return math.sin(angle)
    else:
        return math.cos(angle)

d = 128
for pos in (0, 1):
    for dim in (0, 1):
        v = positional_encoding_scalar(pos, dim, d)
        print(f"pos={pos} dim={dim} PE={v:.4f}")

stdout

pos=0 dim=0 PE=0.0000
pos=0 dim=1 PE=1.0000
pos=1 dim=0 PE=0.8415
pos=1 dim=1 PE=0.5403

Now vectorise. The cleanest way to print the full (seq_len, d) seating chart is with two 1D arrays and a broadcast — no loops. This is the shape of every production PE implementation you'll ever see.

layer 2 — numpy · sinusoidal_pe_numpy.py

python

import numpy as np

def positional_encoding(seq_len, d):
    pos = np.arange(seq_len)[:, None]            # (seq_len, 1)
    i   = np.arange(d)[None, :]                  # (1, d)

    # frequency per dimension: pair 2i and 2i+1 share a freq
    div = np.power(10000.0, (2 * (i // 2)) / d)  # (1, d)
    angle = pos / div                            # (seq_len, d)  via broadcasting

    pe = np.zeros((seq_len, d))
    pe[:, 0::2] = np.sin(angle[:, 0::2])         # even dims → sin
    pe[:, 1::2] = np.cos(angle[:, 1::2])         # odd dims → cos
    return pe

PE = positional_encoding(64, 128)
print("PE shape:", PE.shape)
print("PE[0, :4]: ", np.round(PE[0, :4], 4))
print("PE[1, :4]: ", np.round(PE[1, :4], 4))

stdout

PE shape: (64, 128)
PE[0, :4]:  [0.     1.     0.     1.    ]
PE[1, :4]:  [0.8415 0.5403 0.8218 0.5697]

pure python → numpy

for pos in range(seq_len): for i in range(d):←→pos[:, None] + i[None, :] # broadcast

— two nested loops collapse into one outer-product shape

if i % 2 == 0: sin(...) else: cos(...)←→pe[:, 0::2] = sin; pe[:, 1::2] = cos

— strided slicing addresses even and odd dims independently

10000 ** (2 * (i // 2) / d)←→np.power(10000.0, (2*(i//2))/d)

— same formula, now a vector of per-dim divisors

Finally, the PyTorch nn.Module you would actually ship, plus a tight RoPE helper. The sinusoidal chart is registered as a non-trainable buffer (no gradient flows through it — nobody is learning the seat layout, it was already correct), and the RoPE rotation is applied to Q and K right before the attention softmax.

layer 3 — pytorch · positional_encoding_torch.py

python

import torch
import torch.nn as nn

class SinusoidalPE(nn.Module):
    def __init__(self, d, max_len=5000):
        super().__init__()
        pos = torch.arange(max_len).unsqueeze(1).float()
        i   = torch.arange(d).unsqueeze(0).float()
        div = torch.pow(10000.0, (2 * (i // 2)) / d)
        angle = pos / div
        pe = torch.zeros(max_len, d)
        pe[:, 0::2] = torch.sin(angle[:, 0::2])
        pe[:, 1::2] = torch.cos(angle[:, 1::2])
        self.register_buffer("pe", pe.unsqueeze(0))     # (1, max_len, d), not trainable

    def forward(self, x):                                # x: (B, T, d)
        return x + self.pe[:, : x.size(1)]

def rope(x, pos):                                        # x: (..., T, d), pos: (T,)
    # split last dim into pairs, rotate each pair by pos * per-pair frequency
    d = x.size(-1)
    i = torch.arange(d // 2, device=x.device).float()
    freq = 1.0 / torch.pow(10000.0, 2 * i / d)           # (d/2,)
    theta = pos[:, None].float() * freq[None, :]         # (T, d/2)
    cos, sin = theta.cos(), theta.sin()                  # (T, d/2)
    x1, x2 = x[..., 0::2], x[..., 1::2]                  # split into pairs
    rot1 = x1 * cos - x2 * sin
    rot2 = x1 * sin + x2 * cos
    out = torch.empty_like(x)
    out[..., 0::2], out[..., 1::2] = rot1, rot2
    return out

# sanity check the shapes + a quick relative-position check for RoPE
torch.manual_seed(0)
x = torch.randn(8, 64, 128)
pe = SinusoidalPE(d=128)
print("sinusoidal PE shape:", pe.pe.shape)
print("after adding PE:     ", pe(x).shape)

q = torch.randn(8, 64, 128)
k = torch.randn(8, 64, 128)
positions = torch.arange(64)
q_rot = rope(q, positions)
k_rot = rope(k, positions)
print("rope q shape:        ", q_rot.shape)

# shift both q and k by the same offset — inner product should match the unshifted pair
offset = 5
q_shift = rope(q, positions + offset)
k_shift = rope(k, positions + offset)
diff = ((q_rot * k_rot).sum(-1) - (q_shift * k_shift).sum(-1)).abs().max()
print(f"max |q·k - q'·k'|:    {diff:.1e}  (relative-only check)")

stdout

sinusoidal PE shape: torch.Size([1, 64, 128])
after adding PE:      torch.Size([8, 64, 128])
rope q shape:         torch.Size([8, 64, 128])
max |q·k - q'·k'|:    2.1e-07  (relative-only check)

numpy → pytorch

pe = positional_encoding(seq, d) # numpy array←→self.register_buffer("pe", pe)

— non-trainable tensor, moves with .to(device), no gradient

out = x + pe[:seq_len]←→return x + self.pe[:, :x.size(1)]

— same add, now broadcasting over the batch dim

(no numpy analogue — RoPE is Q/K side)←→q_rot, k_rot = rope(q, pos), rope(k, pos)

— applied inside attention, not added to the token embedding

Gotchas

PE added before LayerNorm changes the statistics it sees: token embeddings are typically initialised with a small standard deviation, and adding a sinusoid with values in [−1, 1]noticeably shifts the variance going into the first norm. Most implementations either scale embeddings by √d before the add (the Vaswani move) or place the PE inside a residual after the first norm. Know which convention the codebase you're reading uses.

Length extrapolation is not free: train a learned-PE model at max_len=512 and infer at 2048 — you crash on an index-out-of-bounds or (worse) silently look up garbage if someone padded the table. Sinusoidal tolerates it mathematically, but most models still degrade sharply past their training length. RoPE and ALiBi degrade more gracefully, but even RoPE needs NTK-aware / YaRN-style frequency scaling to push 4× beyond training context without breaking.

RoPE swaps components during rotation: a common bug when implementing RoPE by hand is forgetting that the rotation mixes the two halves of the pair — you can't just scale x₁ by cos θ and be done. The pair (x₁, x₂) becomes (x₁ cos θ − x₂ sin θ, x₁ sin θ + x₂ cos θ). Miss the cross-term and the inner-product-depends-only-on-relative-position property quietly breaks.

Verify the Vaswani claim empirically

The core theoretical property of sinusoidal PE is that PE(pos + k) can be written as a linear function of PE(pos) for any fixed shift k. Implement the PE matrix, pick a shift k = 7, and find the d × d matrix M_k such that PE(pos + 7) ≈ M_k · PE(pos) for all pos.

Hint: stack the first 100 PE vectors into a matrix P of shape (100, d), do the same for the shifted ones into P_shifted, and solve P · M_kᵀ = P_shifted via np.linalg.lstsq. The residual should be at machine-precision zero. That's the relative-position property, made concrete.

Bonus: check that M_k is, structurally, a block-diagonal rotation matrix — one 2×2 rotation per frequency pair. You have just rediscovered RoPE.

What to carry forward. Attention without positional information is an order-blind set operation — a crowd in an unmarked theater. Every real transformer hands out seat stickers from the outside. Sinusoidal encodings are a fixed, additive barcode built so a linear operator can recover relative offsets. Learned embeddings are a simple lookup table that wins on perplexity and loses on extrapolation — you only printed stickers for the rows you trained on. RoPE spins Q and K on the same turntable so inner products depend only on relative seating by construction, which is why it dominates modern LLMs. ALiBi skips the sticker entirely and biases attention scores by row-distance, which is why it extrapolates so cleanly.

End of the NLP section. We walked from one-hot vectors to word2vec to subword tokenization to positional encoding. You now have every piece a transformer needs as input: tokens that mean something, and seats that know where they are. What remains is the mechanism that actually reads the room.

Next: Self-Attention — the main event. Now every audience member knows where they're sitting. The question the next section answers is how to let them talk to each other without whispering in a straight line, row by row, the way an RNN would. In self-attention every seat queries every other seat at once, compares barcodes, and decides whose voice to listen to. Everything this course has built — gradients, backprop, embeddings, and the seat stickers you just finished putting on — exists to make that single operation work. Time to watch it.

References

[01]
Attention Is All You Need
Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin · NeurIPS 2017 — the original Transformer paper; Section 3.5 defines the sinusoidal PE
[02]
RoFormer: Enhanced Transformer with Rotary Position Embedding
Su, Lu, Pan, Murtadha, Wen, Liu · 2021 — the RoPE paper; used by LLaMA, Gemma, Mistral, Qwen
[03]
Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation
Press, Smith, Lewis · ICLR 2022 — ALiBi
[04]
Dive into Deep Learning — Section 11.6: Self-Attention and Positional Encoding
Zhang, Lipton, Li, Smola