Build Vocabulary

Train a BPE vocab on real text.

Medium

~15 min read

·lesson 2 of 10

Picture a librarian with a clipboard. Not judging books, not shelving them, not recommending summer reads — just walking every shelf in the stacks and tallying how many times each token shows up across the entire corpus. the: four hundred million. probability: twelve thousand. snorgleplop: one, in a fanfic someone scraped by accident. The walk takes a while. The clipboard gets heavy. When it's done, that clipboard is your vocabulary.

That's the whole lesson, in cartoon form. Training a vocabulary is the least glamorous step in a modern LLM pipeline and the first one you actually do. Before you initialize a single weight, before you write a line of model code, before you even check the GPU quota you don't have — you send the librarian through the shelves with a clipboard. Count everything, rank by frequency, keep the top N, save a JSON file. The file is under a megabyte. It ships with the model forever, and the model lives with whatever the librarian wrote down for the rest of its life.

The recipe looks deceptively small: pick a corpus, pick a vocab size, run BPE until the clipboard has that many rows, save the file. Under a megabyte of JSON. But almost every decision the librarian makes along the way — which shelves to walk, how long to tally, what to do with the one-off fanfic token — shows up three months later as a latency number, a scaling curve, or a multilingual regression no one can explain. This lesson is those decisions.

Vocab size (personified)

I'm the one number you pick before you pick any other number. Too small and every sentence becomes a ribbon of subwords — long sequences, slow inference, bloated attention. Too big and most of me is dead weight — embedding rows for tokens that appear once in a million documents. Somewhere between 32k and 100k, for English, is where I earn my keep.

The tradeoff is concrete. Every extra row on the clipboard costs you one row in the embedding table and one row in the output projection. Every token you don't keep forces the tokenization step to split common words into subwords, which makes sequences longer, which makes training and inference both quadratically more expensive (attention is O(n²)). Short clipboard: cheap memory, expensive compute. Long clipboard: the other way around.

the two costs that fight each other

embedding params   =   V · d       (V = vocab size, d = hidden dim)

sequence length    ∝   1 / (tokens-per-word)      ≈ f(V)

compute per step   ≈   n² · d      (attention dominates at long n)

⇒  small V  →  long n  →  big n²·d    (compute wins)
   big   V  →  short n  →  big V·d    (params win)
   sweet spot lies in between — usually 32k–100k for English.

vocab size vs tokens per document · Zipfian synthetic corpus

|corpus| = 50k unique words · s = 1.0

vocab preview · 100-token sample

·and

·a

·to

·in

·is

·it

·tha

·he

·for

·was

·on

·are

·wit

·his

·at

·thi

·or

·by

·wha

·we

·sai

·eac

·the

·wor

·sto

·ope

·wor

·han

·run

·han

·wor

·tur

·wal

·sto

·mak

·sto

·wal

·mak

·ope

·mak

·wor

·wal

·rea

·pla

·tur

·ope

·wal

·ope

·tur

·sto

·wal

·run

·wal

·run

·mak

·run

·mak

·pla

·han

·pla

·rea

·run

·wor

quyl

sksk

ghxy

phph

shkk

zzzz

quyl

tztz

wrwr

tztz

ylqu

quyl

tztz

ylqu

zzzz

kksh

zzzz

ylqu

amber = top common words · cyan = morphemes · rose = long-tail shards

vocab V32.0k

tok/doc823

zonesweet spot

Drag the slider. At V = 100 the librarian kept almost nothing — the tokenizer is effectively byte-pair on characters, most English words take four to eight tokens, and a tweet is 200 tokens long. At V = 50,000 the clipboard holds every common word outright; compression is roughly 0.75 tokens per word. Doubling again to 100k (GPT-4) buys diminishing returns — the marginal rows the librarian writes down are rare proper nouns and code symbols. Past that you're paying an embedding row for tokens that show up once every thousand documents.

The slider shows you the shape of the tradeoff. It doesn't show you what the librarian was counting. Which is the next thing.

Multilingual gets ugly fast. Say the librarian walks Common Crawl: about 45% of the shelves are English, 5% Chinese, 3% Spanish, and the long tail is everything else. Count raw, rank by frequency, and the clipboard is — unsurprisingly — mostly English. Swahili gets character-level representation and a 20x longer sequence for the same sentence, because its tokens never made the frequency cut.

The fix is upsampling: send the librarian through the low-resource shelves more than once, so their tally for those languages climbs faster than the raw corpus frequency would suggest. The mT5 paper uses a temperature-based sampler; LLaMA 2 hand-tuned the weights. Either way, the goal is the same — move everyone's tokens-per-word into the same ballpark.

raw Common Crawl share                upsampled training mix (α = 0.3)

 en  ████████████████████  45 %       en  ████████████  28 %
 zh  ███                    5 %       zh  █████         12 %
 es  ██                     3 %       es  ████           9 %
 ar  █                     1.5%       ar  ███            7 %
 hi  ▌                     0.7%       hi  ███            6 %
 sw  ▏                     0.1%       sw  ██             4 %
  …                                    …

 → English tokens win 90% of merges    → every language gets a fair shot
   rare-language seqs ~20× longer        sequence lengths within 2–3× of en

naive vs upsampled multilingual training mix

Whatever mix the librarian walks, the final frequency column on the clipboard always looks like this — and this is where the frequency-cutoff decision actually bites:

token frequency — log-log Zipf curve

N = 5,000 unique tokens

Zipf exponent s1.00

rank—

freq—

top-20 share39.6%

Log-log axes, near-perfect straight line. Zipf's law all over again, because language. The top 100 rows on the clipboard — spaces, common subwords, punctuation — account for maybe 40% of all the librarian's tally marks. The bottom 10% of the vocabulary shows up in fewer than 0.01% of documents. Those rare rows still cost one embedding parameter each, and they barely get enough gradient signal during training to learn anything useful.

This is the frequency cutoff in real life. The librarian keeps the top N by count and draws a line. Everything above the line gets a token id. Everything below it — the tokens that fell off the clipboard — either gets split into smaller pieces the tokenizer already knows, or dumped into an <UNK> bucket on classical tokenizers. The librarian never judges content. They just count. If a string of bytes didn't show up often enough, it doesn't make the cut, no matter how meaningful it might be to someone somewhere.

This is one of the reasons vocab-size tuning matters. Past some point, every new row the librarian adds is in the long tail — it probably won't help the model, and it's guaranteed to cost memory.

Special token (personified)

I'm the slot the BPE algorithm never sees. I get a fixed id and a fixed string — <|endoftext|>, [PAD], <|user|> — and I get added after the librarian finishes counting, stapled onto the end of the clipboard. Don't ever let BPE learn my bytes. Don't ever let a user sneak my string into their prompt.

Special tokens are the hooks. They tell the model when a document starts, when a message ends, where padding begins, who is speaking. They are not learned by BPE — the librarian doesn't tally them. You reserve specific ids for them, set their embedding to something sensible (usually random init), and make sure the tokenizer will never produce them from ordinary text.

The standard cast, across the industry:

[PAD] — padding to make variable-length sequences into rectangular batches. Attention mask zeroes them out.
[CLS] / [SEP] — BERT-era classification and sentence separators. Still used in encoder models.
<s> / </s> — sentence boundaries for T5 / BART-style seq2seq.
<|endoftext|> — GPT-2's original document separator, inherited by most GPT-family tokenizers.
<|user|> / <|assistant|> / <|system|>— chat-template roles. Post-GPT-3.5. These are the tokens that make instruction tuning tractable.
Reserved slots. LLaMA 2 stapled roughly 256 unused token ids onto the end of the clipboard. They do nothing at pretraining time. When someone later wants to add a new role, a tool-use marker, or a modality token, they don't have to resize the embedding matrix — they claim a reserved id and keep moving.

What does the librarian's clipboard actually look like once you pour it onto disk? A tokenizer is typically two files: a vocab.json mapping tokens to ids, and a merges.txt listing BPE merges in order. Combined size for a 50k vocab: under a megabyte. It ships with the model forever. You version it like code.

my-tokenizer/
├── vocab.json                { "<|endoftext|>": 50256, "the": 262, "Ġthe": 11 , … }
├── merges.txt                #  50,000 lines, one per BPE merge, order matters
│                             t h
│                             th e
│                             Ġ t
│                             …
├── special_tokens_map.json   { "bos_token": "<|endoftext|>", "pad_token": "[PAD]" }
└── tokenizer_config.json     { "model_max_length": 2048, "add_prefix_space": false, … }

total on disk:  ~900 KB for 50 000 tokens.   shippable as part of the model repo.

what a tokenizer actually is, on disk

Three ways to send the librarian through the stacks. The first is the toy version — a pure-Python BPE loop with pre-tokenization on whitespace, useful for understanding and nothing else. The second is what you'd actually reach for in a real project. The third is the clipboard that's already loaded on every GPT-4 inference server on Earth.

layer 1 — pure python · train_bpe.py

python

from collections import Counter

def pre_tokenize(text):
    # split on whitespace first — BPE only merges within words, never across.
    return [list(w) for w in text.split()]

def get_pair_stats(words, freqs):
    pairs = Counter()
    for word, f in zip(words, freqs):
        for i in range(len(word) - 1):
            pairs[(word[i], word[i + 1])] += f
    return pairs

def merge_pair(words, pair):
    a, b = pair
    out = []
    for word in words:
        new, i = [], 0
        while i < len(word):
            if i < len(word) - 1 and word[i] == a and word[i + 1] == b:
                new.append(a + b); i += 2
            else:
                new.append(word[i]); i += 1
        out.append(new)
    return out

def train_bpe(corpus, num_merges=4000):
    words_by_freq = Counter(corpus.split())
    words  = [list(w) for w in words_by_freq.keys()]
    freqs  = list(words_by_freq.values())
    merges = []
    for _ in range(num_merges):
        pairs = get_pair_stats(words, freqs)
        if not pairs: break
        best = max(pairs, key=pairs.get)
        words = merge_pair(words, best)
        merges.append(best)
    return merges

corpus = open("shakespeare.txt").read()
merges = train_bpe(corpus, num_merges=4000)
print(f"merges learned: {len(merges)}")

stdout

merges learned: 4000
vocab size:     4256   (256 byte baseline + 4000 merges)
"Hello, world!" → ['H', 'e', 'l', 'lo', ',', ' ', 'wor', 'ld', '!']

That version is 30 lines, readable end-to-end, and about ten thousand times too slow for anything real — picture our librarian on a unicycle, stopping at every shelf to rewrite the clipboard by hand. In production you use the Hugging Face tokenizers library — Rust under the hood, ByteLevel pre-tokenization, trains a 50k vocab on 10 GB of text in under ten minutes.

layer 2 — Hugging Face tokenizers · train_hf_bpe.py

python

from tokenizers import Tokenizer, models, trainers, pre_tokenizers, decoders

tok = Tokenizer(models.BPE())
tok.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False)
tok.decoder       = decoders.ByteLevel()

trainer = trainers.BpeTrainer(
    vocab_size=50_000,
    min_frequency=2,
    special_tokens=["<|endoftext|>", "<|user|>", "<|assistant|>", "[PAD]"]
                   + [f"<|reserved_{i}|>" for i in range(256)],      # LLaMA-style slots
    initial_alphabet=pre_tokenizers.ByteLevel.alphabet(),            # full 256-byte base
)

tok.train(files=["corpus_shard_00.txt", "corpus_shard_01.txt", ...], trainer=trainer)
tok.save("my-tokenizer.json")

print("vocab size:", tok.get_vocab_size())
print(tok.encode("Hello, world!").tokens)

stdout

[00:04:18] Pre-processing files       (10 GB,  14.2M docs)   ━━━━━━━━ 100 %
[00:09:51] Tokenize words                                    ━━━━━━━━ 100 %
[00:12:07] Count pairs                                       ━━━━━━━━ 100 %
[00:41:02] Compute merges  (50 000)                          ━━━━━━━━ 100 %
vocab size: 50257   (50000 BPE + 1 <|endoftext|> + 256 reserved)

And because you rarely actually send the librarian through the shelves from scratch — most projects start from someone else's clipboard — layer three is what it looks like to use a production tokenizer. This is tiktoken, which is what OpenAI ships for GPT-4.

layer 3 — tiktoken (GPT-4's actual vocab) · use_tiktoken.py

python

import tiktoken

enc = tiktoken.encoding_for_model("gpt-4")     # the real cl100k_base vocab
print("gpt-4 vocab size: ", enc.n_vocab)

ids = enc.encode("Hello, world!")
print("encoded tokens:   ", ids)
print("decoded back:     ", repr(enc.decode(ids)))

# quick compression check
poem = open("raven.txt").read()
n_tokens = len(enc.encode(poem))
n_words  = len(poem.split())
print(f"avg tokens/word in The Raven: {n_tokens / n_words:.2f}")

stdout

gpt-4 vocab size:  100277
encoded tokens:    [9906, 11, 1917, 0]
decoded back:      'Hello, world!'
avg tokens/word in The Raven: 1.41

pure python → HF tokenizers → tiktoken

split on whitespace; merge pairs in a Python loop←→ByteLevel pre-tokenizer + Rust BPE trainer

— same algorithm, 10,000× faster, handles raw bytes and Unicode edges

manual special-tokens list in code←→special_tokens=[...] + 256 reserved slots

— reserving ids at train time = no embedding resize later

pickle.dump(merges)←→tokenizer.json ←→ tiktoken.Encoding

— one JSON file ships with the model for its entire lifetime

Gotchas

Forgetting the pre-tokenization rule: BPE only merges within “words” as defined by your pre-tokenizer. If the librarian tallied with ByteLevel(add_prefix_space=False) and inference code later calls add_prefix_space=True, the token ids silently shift for every word after the first one. Save the pre-tokenizer config with the vocab. No exceptions.

Version drift: Hugging Face tokenizers library updates sometimes change how ties in pair counts are broken, or how Unicode normalization is applied. Two librarians with the same corpus and same seed can write down subtly different merge orders across library versions. Pin the version that trained your production tokenizer.

NFD vs NFC normalization: Unicode has multiple ways to write the same character. é can be one codepoint (NFC) or two (“e” + combining accent, NFD). If the librarian counted on NFC and you feed it NFD text at inference, accented characters become two separate tokens. Normalize at both ends of the pipe, identically.

Special-token injection: if you naively pass user input through the tokenizer as plain text, a user who types <|endoftext|> literally into their prompt can inject the real document-boundary token and confuse your model. Use the allowed_special or disallowed_special flags and treat raw user text as untrusted.

Train a 4000-token BPE on Shakespeare

Send the librarian through Tiny Shakespeare (tinyshakespeare.txt, about 1 MB). Using the Hugging Face tokenizers library, train a BPE with vocab_size=4000, ByteLevel pre-tokenization, and min_frequency=2.

Take the 100 most common English words (from any standard list — Google's 10k word list works). Run each through your tokenizer and count how many come out as a single token id. That's your “single-token coverage” score — how much of real English made it onto a Shakespearean clipboard.

Bonus: retrain with vocab_size=1000 and vocab_size=16000. Plot single-token coverage vs vocab size. The curve flattens around V ≈ 2000 — past that you're mostly paying for Shakespearean proper nouns and archaic spellings, which is exactly the corpus-vs-target-distribution point.

What to carry forward. Vocabulary training is a librarian with a clipboard, a frequency cutoff, and a set of engineering decisions you'll live with for the life of the model. Vocab size trades sequence length against embedding parameters — 32k to 100k is the working range for English, larger for multilingual. Corpus selection is the vocabulary: the shelves you walk decide what gets tallied, which decides what's cheap to say. Upsample rare languages so their tokens make the cut. Reserve slots for the specials you'll need later. Special tokens go on after training, never before. And the whole artifact fits in a sub-megabyte JSON file you version like code.

Next up — Tokenization Edge Cases. Our librarian built a clean clipboard on well-behaved text. Real text is not well-behaved. Emoji glued to punctuation, invisible zero-width joiners, URLs that end in tracking parameters, source code with tabs-vs-spaces holy wars, and the occasional user input that is literally the string <|endoftext|>. Next lesson: the places tokenizers break when the real world walks into the library, and what the production-tested defenses look like.

References

[01]
Neural Machine Translation of Rare Words with Subword Units
Sennrich, Haddow, Birch · ACL 2016 — the BPE-for-NMT paper that started it all · 2015
[02]
SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing
Kudo, Richardson · EMNLP 2018
[03]
LLaMA: Open and Efficient Foundation Language Models
Touvron et al. · Meta AI, 2023 — the reserved-token-slots paper
[04]
Hugging Face tokenizers — documentation
huggingface.co/docs/tokenizers
[05]
tiktoken — OpenAI’s fast BPE tokenizer
github.com/openai/tiktoken