Tokenizer (Byte Pair Encoding)

BPE from scratch — merges, vocabs, edge cases.

Hard
~15 min read
·lesson 1 of 10

Before any transformer can read a sentence, somebody has to chop the sentence into numbered chunks. That chopping is tokenization, and it is the quietly most important problem in NLP. Two obvious ways to do it both fail. Give every character its own token and a sentence turns into a parade — your sequence length balloons and the model burns attention on the letters in “the.” Give every word its own token and the first time someone types pseudohypoparathyroidism the model hits <UNK> and shrugs.

BPE is what sits in the middle, and the shape of it is almost cartoonishly simple. Think of it as the greedy compressor that grows its own alphabet. Start with just letters — the alphabet you were born with. Then scan the corpus, spot the most common pair of neighbors (t-h is everywhere), and glue that pair into a single new symbol. Now your alphabet is one symbol longer. Do that again. And again. Do it 50,000 times and you have a dictionary where the most common English chunks are each one token, rarer stuff decomposes into a handful of pieces, and nothing is ever truly unknown.

That is Byte Pair Encoding — the tokenizer GPT-2, GPT-3, GPT-4, Llama, and essentially every modern LLM runs on. It started life as Philip Gage's 1994 data-compression trick. Sennrich dusted it off for machine translation in 2015. Now it sits in front of every prompt you have ever typed to an LLM. This lesson builds one from scratch: train it on a tiny corpus, watch the merges happen, encode a sentence with the resulting merge list, then see why the GPT-2 authors swapped characters for bytes.

Tokenizer (personified)
I am the gatekeeper. The model will never see your string — only the integer IDs I assign. Choose me poorly and no amount of parameters will save you. Choose me well and the model sees an efficient, regular stream.

Back to the compressor image. We start with an alphabet — the smallest unit we are willing to see. For now, call it letters; in a minute we will swap letters for bytes. Every word in the corpus gets split down to that alphabet, so lower is just five atoms sitting in a row: l, o, w,e, r. Now count every adjacent pair across the whole corpus. Pick the pair that shows up most often. Glue those two atoms into a single new symbol — that new symbol joins the alphabet. Rewrite the corpus with the glue applied. Then do the whole thing again. And again.

That is the entire training algorithm. Count, pick, glue, repeat.

BPE training — the whole algorithm
vocab    ←  set of all unique base units in the corpus
merges   ←  []

repeat until |vocab| = target_size:
    pairs    ←  count(adjacent_pair(a, b) for all (a, b) in corpus)
    (a*, b*) ←  argmax_(a,b)  pairs[(a, b)]
    corpus   ←  replace every (a*, b*) with new token (a* b*)
    vocab    ←  vocab ∪ { (a* b*) }
    merges.append( (a*, b*) → (a* b*) )

No gradients. No neural net. No loss function. Just counting pairs and gluing the winner. The output is a merge list — an ordered sequence of rewrite rules, one per step. The merge list is the tokenizer. To encode new text you replay the merges in the same order they were learned. To decode you run them in reverse. Everything else the libraries wrap around this is packaging.

Step through the training loop on a five-sentence corpus. Each click advances one merge. Watch the pair-frequency table, see the winning pair get highlighted, and notice the alphabet grow by exactly one new token per step.

byte-pair encoding on "the quick brown fox jumps"
base vocab 27 chars · step up to 10 merges
before · 25 tokens
the·quick·brown·fox·jumps
↓ pick the highest-frequency adjacent pair ↓
after · 25 tokens
the·quick·brown·fox·jumps
counts (top adjacent pairs in the "after" row)
"t" + "h"
1
"h" + "e"
1
"q" + "u"
1
"u" + "i"
1
"i" + "c"
1
"c" + "k"
1
vocabulary · 20 tokens
0"t"
1"h"
2"e"
3·space·
4"q"
5"u"
6"i"
7"c"
8"k"
9"b"
10"r"
11"o"
12"w"
13"n"
14"f"
15"x"
16"j"
17"m"
18"p"
19"s"
step0 / 0
pair
freq0
|V|20

The first few merges always feel boringly predictable, and that is the point. Step 1 almost always glues (t, h). Step 2 glues (h, e). Step 3 the compressor looks at the th it just minted, spots (th, e) everywhere, and glues those — and congratulations, the is now one token. Keep cranking. Frequent digrams like (e, r), (i, n), and (i, ng) light up next. After a few hundred merges the alphabet contains every common English word as a single symbol, and the remaining budget gets spent on suffixes (-tion, -ing), roots (pseudo), and the long tail of subwords you need to assemble anything stranger.

This is what lets a word like pseudohypoparathyroidism survive without any special handling. It never earned a single token — it was not frequent enough to win a merge round. But pseudo, hypo, para,thyroid, and ism each might have. The word decomposes into four or five familiar pieces. The model reads it as a sequence of known fragments, not a single alien blob. No <UNK>, no panic.

Merge (personified)
Give me a pile of text and I will tell you what it loves to put next to what. (t, h) always together? You just bought yourself a th. Show me that th nestled up to e in half your corpus? Fine, have your the. I am a greedy compression oracle. I take the cheapest possible win, every time, until you tell me to stop.

It is reasonable, at this point, to ask the obvious question: why not just use whole words as tokens and skip the merge dance entirely? Because of the math below. A word-level vocabulary has to memorize every inflection, every proper noun, every typo, every new word the internet coins next Tuesday. You either bloat the dictionary into the millions or you fall back to <UNK> the moment a user types something unexpected. BPE splits the difference — common words do become single tokens once the compressor merges their letters enough times, but the dictionary stays capped because rare words borrow pieces instead of demanding their own slot. Measure it.

what BPE buys you
compression_ratio   =   C / T          (chars per token)

tokens_per_word     =   T / W          (W = word count)

tokens_per_byte     =   T / B          (bytes, for byte-level BPE)


GPT-4 (cl100k_base, V = 100 277):

    English      ≈  3.8 chars/token     ≈  0.75 tokens/word
    Python code  ≈  2.9 chars/token     ≈  slightly lossy on symbols
    Thai         ≈  0.9 chars/token     ≈  ~5× worse than English
    Burmese      ≈  0.5 chars/token     ≈  ~10× worse than English

Read that table carefully. Every modern LLM is charged per token. Every context window is counted in tokens. When a Thai user and an English user type the same paragraph, the Thai user pays five times the tokens to say the same thing, gets five times less context, and runs up five times the latency. That asymmetry is not a bug in the math — it is a direct consequence of training the compressor on a corpus dominated by English web text. The frequent pairs that won merge rounds were English pairs. Everyone else pays the tax.

Now try it live. Type a sentence, watch it decompose into byte-pair tokens with the merges applied in the exact order they were learned. Notice how leading spaces travel with the word that follows them (a GPT-2 convention — more on that below) and how unusual words fragment into smaller pieces while common words survive as a single token. The merge list you trained is the dictionary; this widget is it running in reverse on your input.

hand-coded BPE · ~30 merges · click any chip for its lineage
vocab |V| = 75
input · lots of merges hit — "the", "ick", "ing" all become single tokens
the quick brown fox is running
tokens · click for lineage
chars30
tokens20
chars/tok1.50

Two things are worth staring at. First, "hello" and " hello" (with a leading space) are different tokens. Not a bug — the GPT-2 tokenizer deliberately binds the leading space to the token because it makes sentence-level encoding cheaper and decoding lossless. It means the word at position 0 in a sentence and the same word mid-sentence get different IDs. Get used to it.

Second, capitalization matters. "The", "the", and " the" are three separate tokens in the dictionary. They share zero parameters in the embeddings table. The model has to learn that they mean roughly the same thing — and it does, but only because it sees them in similar contexts a million times.

Byte (personified)
I am the atom. 256 values, zero ambiguity, no Unicode drama. Whatever you throw at me — emoji, CJK ideographs, a corrupted file — decomposes cleanly into some sequence of me. Build your vocabulary out of me and you will never meet an <UNK> token again. That is my one job and I am very good at it.

GPT-2 introduced one small, brilliant change to the Sennrich recipe: don't start the alphabet from characters, start it from bytes. A character-level compressor has to decide what a character even is. Does é count as one symbol or two (e plus a combining acute accent)? Which Unicode normalization form? If the user pastes a malformed UTF-8 sequence, do you crash? What about 🦀, or , or the weird zero-width joiner some emoji smuggle in?

A byte-level BPE sidesteps every one of those questions. The starting alphabet is just the 256 possible byte values. Every possible input — every Unicode character, every emoji, every binary blob — encodes to some sequence of bytes, which decomposes trivially into those 256 atoms. The merges then glue common byte sequences into larger tokens, and now the compressor can chew on anything — source code, Chinese, Python, a PDF — with the same 50K-entry dictionary, and decoding is a perfect inverse. No loss. No <UNK>. Ever.

Three implementations, each shorter than the last. First, pure Python on a tiny corpus so you can watch every pair, every merge, every rewrite. Second, a numpy-free encoder that applies a trained merge list to a new word. Third, the Hugging Face tokenizers library — what you would actually ship.

layer 1 — pure python BPE training · bpe_scratch.py
python
from collections import Counter

# tiny corpus — 5 pseudo-sentences, already split into words
corpus = ["low", "low", "low", "lower", "lower", "newest", "newest", "newer"]

# step 0: each word becomes a tuple of characters (the starting "base units")
words = [tuple(w) for w in corpus]

merges = []
for step in range(5):
    # count every adjacent pair across the whole corpus
    pair_counts = Counter()
    for w in words:
        for i in range(len(w) - 1):
            pair_counts[(w[i], w[i + 1])] += 1

    # greedy — pick the most frequent pair
    best_pair, count = pair_counts.most_common(1)[0]
    merges.append(best_pair)
    merged = best_pair[0] + best_pair[1]

    # rewrite the corpus: replace every (a, b) with the merged token
    new_words = []
    for w in words:
        out, i = [], 0
        while i < len(w):
            if i < len(w) - 1 and (w[i], w[i + 1]) == best_pair:
                out.append(merged)
                i += 2
            else:
                out.append(w[i])
                i += 1
        new_words.append(tuple(out))
    words = new_words
    print(f"merge {step+1}: {best_pair} -> {merged!r}  (count={count})")

vocab = set(t for w in words for t in w)
print(f"\nvocab after 5 merges:", vocab)
stdout
merge 1: ('l', 'o') -> 'lo'        (count=3)
merge 2: ('lo', 'w') -> 'low'      (count=3)
merge 3: ('e', 'r') -> 'er'        (count=2)
merge 4: ('n', 'e') -> 'ne'        (count=2)
merge 5: ('ne', 'w') -> 'new'      (count=2)

vocab after 5 merges: {'l','o','w','e','r','n','s','t','lo','low','er','ne','new'}

Twenty lines, the whole compressor. You can read the greedy logic off the page: the inner loop counts every adjacent pair, the most_common(1) call picks the winner, the glue step rewrites the corpus so the newly-minted token participates in the next round of counting. The first winning pair was (l, o). Five merges later, low and new are single tokens while er and est are reusable suffixes. A 300-entry dictionary would turn most of common English into single-token words. A 50K dictionary covers essentially all of it plus the interesting corners.

layer 2 — encoder from a trained merge list · bpe_encode.py
python
def encode(word: str, merges: list[tuple[str, str]]) -> list[str]:
    """Apply merges in the exact order they were learned."""
    tokens = list(word)                               # start from characters

    for a, b in merges:                               # iterate merges in order
        i = 0
        out = []
        while i < len(tokens):
            if i < len(tokens) - 1 and tokens[i] == a and tokens[i + 1] == b:
                out.append(a + b)                     # apply this merge
                i += 2
            else:
                out.append(tokens[i])
                i += 1
        tokens = out
    return tokens

# the merges learned in layer 1
merges = [('l', 'o'), ('lo', 'w'), ('e', 'r'), ('n', 'e'), ('ne', 'w')]

for w in ["lower", "newest", "lowerness"]:
    print("tokens:", encode(w, merges))
stdout
tokens: ['low', 'e', 'r']
tokens: ['new', 'e', 's', 't']
tokens: ['low', 'e', 'r', 'ne', 's', 's']        # 'lowerness' — unseen word, fragmented
training → encoding
count adjacent pairs, pick argmax←→iterate saved merges in order

training is greedy search; encoding is deterministic replay

merges list grows over time←→merges list is frozen, read-only

the merge list IS the tokenizer — save it, ship it

rewrite corpus after every merge←→rewrite one word at a time, never retrain

encoding is O(words × |merges|) — fast enough in practice

In production nobody writes the encoder by hand. You use a library that handles byte-level encoding, pre-tokenization (splitting on whitespace and punctuation), regex for the GPT-style space-handling, and a fast Rust implementation of the merge loop. That's Hugging Face tokenizers, or OpenAI's tiktoken. Same greedy compressor underneath — just with the inner loop rewritten in a language that does not stop to breathe. Here's what training looks like from the outside.

layer 3 — HuggingFace BpeTrainer · bpe_hf.py
python
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import ByteLevel

# byte-level BPE, just like GPT-2
tok = Tokenizer(BPE(unk_token=None))
tok.pre_tokenizer = ByteLevel(add_prefix_space=False)

trainer = BpeTrainer(
    vocab_size=2000,                          # target vocabulary size
    special_tokens=["<|endoftext|>"],
    initial_alphabet=ByteLevel.alphabet(),    # 256 bytes as starting atoms
)

tok.train(files=["shakespeare.txt"], trainer=trainer)  # pretend this file exists

print("vocab size:", tok.get_vocab_size())
enc = tok.encode("The quick brown fox")
print("encoding 'The quick brown fox':", enc.tokens)

enc = tok.encode("pseudohypoparathyroidism")
print("encoding 'pseudohypoparathyroidism':", enc.tokens)
stdout
vocab size: 2000
encoding 'The quick brown fox': ['The', ' quick', ' brown', ' fox']        # most freq words — single token each
encoding 'pseudohypoparathyroidism': ['pseudo', 'hyp', 'op', 'ara', 'thyroid', 'ism']
numpy-free encoder → HuggingFace
merges = [(a, b), ...]←→tok = Tokenizer(BPE(...))

the merge list lives inside the Tokenizer object

encode(word, merges) -> list[str]←→tok.encode(text).ids -> list[int]

the library also assigns integer IDs for direct model input

iterate python merge list←→compiled Rust applies all merges in one pass

production tokenizers are 100-1000× faster than pure python

Gotchas

Leading spaces are tokens: "hello" and " hello" are different tokens in every GPT-style tokenizer. If you programmatically concatenate strings, check whether you are about to mint a different token sequence than the one the model was trained on. This bug is the source of most “why is my LLM performance bad on my own dataset” posts.

Unicode normalization: the string "café" can be encoded in UTF-8 as either 63 61 66 C3 A9 (precomposed é) or 63 61 66 65 CC 81 (e + combining accent). These tokenize to different sequences. Normalize (NFC is usually right) before tokenizing — and apply the same normalization at inference time that you applied during training.

Numbers fragment badly: BPE doesn't know arithmetic. The number 12345 might tokenize as "123", "45" or "12", "345" or any other split depending on what sequences were frequent in training. This is partly why LLMs struggle with long arithmetic. Modern models (Llama-3, GPT-4o) explicitly split digits into single-digit tokens to fix this.

Decoding a partial stream: if you interrupt generation mid-token — e.g., streaming token-by-token to a UI — the last token might be an incomplete UTF-8 sequence. Buffer bytes until you have a complete character. Otherwise you get mojibake on the last chunk.

Train BPE on Shakespeare

Grab the complete works of Shakespeare as a plain text file (about 5 MB, shakespeare.txt is a standard benchmark). Train a byte-level BPE tokenizer on it with vocab_size=256 using Hugging Face BpeTrainer. Since you start from 256 bytes, you will learn exactly zero merges — this is your baseline.

Now re-train with vocab_size=1000, then 5000, then 10000. For each, encode the full corpus and measure tokens-per-byte. You should see it drop from 1.00 (no merges) to around 0.35-0.40 (10K vocab). That ratio is the compression ratio you bought with the extra 9 744 vocab slots.

Bonus: print the 20 longest tokens in the 10K vocab. You will find whole Shakespeare words like "Hamlet" and "Rosencrantz" as single tokens — the compressor has memorized the corpus's proper nouns. That's what “trained on your data” means.

What to carry forward. BPE is a greedy compressor that grows its own alphabet. It starts from bare atoms — bytes, in practice — spots the most common pair, glues the pair into a new symbol, and repeats until the dictionary hits the size you asked for. That solves out-of-vocabulary by letting rare words borrow common pieces, caps vocabulary bloat by construction, and — in the byte variant — handles every Unicode corner the same way it handles ASCII. The merge list is the tokenizer: train once, apply forever. And the compression ratio is not uniform across languages, which is a genuine source of real-world model bias that most benchmarks quietly ignore.

Next up — Build Vocabulary. You have seen the compressor run on five words. Now point it at a real corpus, ship it, and watch what a 50K-entry dictionary actually contains. That is build-vocabulary — training a BPE vocab on real text, spelunking the result, and making sure the merge list you save matches the one you load.

References