Tokenization Edge Cases

Whitespace, unicode, emoji, and the quirks of tiktoken.

Medium
~15 min read
·lesson 3 of 10

Every tokenizer has its Tower of Babel moments — the places where the tokenizer's grammar breaks down and strings that look identical start speaking different dialects to the model. Whitespace that isn't whitespace. Numbers that chunk weirdly. Unicode combining characters that fool byte-level counting. Emoji that are three bytes pretending to be one character. Each one is a speaker of a different dialect, walking up to the tokenizer's interpreter and getting translated into something the model has never heard before.

The tokenizer is the thinnest, most-ignored layer in the entire stack, and it is responsible for roughly half of the weird bugs you will ever hit in production LLM work. It sits between a human string and the model's integer interface, and it does not behave the way you think strings do.

You ship a prompt. It works. Your colleague copy-pastes it into Slack, which helpfully adds a leading space — a different dialect, and the completion quality drops 10%. You try to do arithmetic with GPT-4 and it gets 7-digit multiplication reliably wrong because the digits keep splitting in different places. A Japanese customer's bill is three times larger than an English customer's for the same conversation length. All three stories have the same root cause: the tokenizer broke down at an edge case and nobody was watching.

This lesson is the Babel tour. The boring corners, the expensive corners, and the haunted corners — the rare tokens that make GPT hallucinate nonsense when you mention them by name. Assumes you've walked through tokenization, that you know BPE well enough to recognize a merge when you see one, and that you understand how a vocabulary gets baked into a fixed table. By the end you should trust the tokenizer exactly as much as it deserves, which is not very.

Tokenizer (personified)
I am not the model. I am older, dumber, and permanent. I was trained once, on one corpus, and frozen. Every string you will ever send passes through me first. If I map your input to a weird sequence of IDs, the model sees weirdness — and it cannot look back through me to see your original text. That is the entire contract.

A curated gallery of strings that embarrass the tokenizer — every dialect in one place. Click through each one. You'll see the same text tokenize completely differently depending on whether it has a leading space, lives in an emoji, uses a rare character, or happens to be a long number. Identical-looking inputs, different token IDs. The token count on the right is the thing you pay for, and the one the model has to reason through.

tokenization edge cases — 3 tokenizers, same input
11 chars · 13 bytes
input
hi␣🎉␣party
emoji: a single codepoint emoji
whitespace5 tok
hi🎉party
word BPE5 tok
hi<UNK>party
byte-level BPE7 tok
hi·<0xF0><0x9F><0x8E><0x89>·party
surprise: byte-level BPE explodes emoji into 4 raw UTF-8 bytes
Leading space (personified)
I am the silent byte at the front of your word. I am invisible to you, but not to the tokenizer — I make " Paris" a totally different vector from "Paris". Your trim() call just changed the meaning of your prompt by a noticeable amount, and you didn't even notice I existed.

Here is the edge case that should bother you most, because it crosses from “weird bug” into “structural unfairness.” A GPT-4-family tokenizer was trained predominantly on English text. English words lump together into single tokens. Languages it saw less often do not — they get chopped character by character, a foreign dialect that the greedy compressor never learned to pack. The ratio of tokens per character you pay for is wildly different by language.

tokens-per-character, same paragraph, translated — GPT-4 family
English           :  ~0.25 tokens / char        ("the cat sat"            →  3 tokens / 11 chars)
Spanish           :  ~0.30 tokens / char        ~1.2x   cost vs English
Chinese (simpl.)  :  ~0.90 tokens / char        ~3.5x   cost vs English
Japanese          :  ~1.00 tokens / char        ~4.0x   cost vs English
Hindi (Devanagari):  ~1.20 tokens / char        ~4.8x   cost vs English
Burmese           :  ~2.00 tokens / char        ~8.0x   cost vs English

Read that table twice. A Japanese-speaking user of your product, holding a conversation identical in meaning to an English user's, will hit your context length window four times faster and pay four times more per API call. The Yenai & Petrov 2024 paper documented ratios up to 15x for some language / tokenizer combinations. This is not a rounding error — it's a tax on non-English dialects baked into the model's interface, a Babel penalty charged at the gate.

The same mechanism — common strings merge, rare strings shatter — is the reason numbers tokenize like they're trying to fool you. "1234" might be a single token because it appeared often in training data (it's a common year). "1235" might split into "12" and "35". "123456" might chunk as "12" + "345" + "6", and "123456789" could be 4 tokens. This is why GPT-4 is worse at arithmetic than it “should” be — digits don't line up cleanly between problems, so the model can't just reason in a digit-by-digit way. It has to learn the algebra of chunks whose boundaries keep shifting.

Type anything into the box. You'll see the actual GPT-4-style tokenization — token IDs, the byte decomposition of each token, and a running cost. Try " hello" vs "hello" — watch two identical-looking words collide into two different IDs. Try "1234567890" vs "2024". Paste a sentence in Japanese. Paste some Python. Paste a URL. Watch the tokenizer flail its way through each dialect in turn.

chars · bytes · tokens — selection-synced
hover any row to highlight the rest
charsall ASCII
deffoo(bar):#hello
bytes (UTF-8, hex)24 bytes
64656620666F6F28626172293A2020232068656C6C6F2020
BPE tokens10 tokens · chars/tok 2.40
def3B1Bfoo3B(bar4B)1B:␣2B1B#␣2Bhello5B␣␣2B
chars24
bytes24
tokens10

Now the haunted corner — the deepest Babel moment of them all. In 2023, Rumbelow and Watkins went searching for tokens in GPT-2's vocabulary that had been allocated a token ID but that appeared almost never in the training data. The tokenizer had learned them from one corpus — largely Reddit usernames — and the model had been trained on a different, cleaner one. So the model saw those token IDs roughly zero times during training. Two corpora, two dialects, and one poor model trying to read both.

What happens when you prompt a model with a token it has never trained on? The model's embedding for that token is basically untouched random noise. The behavior goes off the rails. Asking GPT-3 to “please repeat the string SolidGoldMagikarp back to me” caused it to output "distribute", insult the user, or refuse to acknowledge that the word existed. Dozens of such tokens were found. We now call them glitch tokens, and every production tokenizer has them in some number.

Rare token (personified)
I was baked into the tokenizer's vocabulary years ago, during an era of the internet that no longer exists. My embedding vector has never been gradient-updated. I am a ghost in the vocabulary — a valid ID with no trained meaning. If you summon me by name, the model will speak in tongues.

Three layers of code, smallest to largest. Pure Python to stage the leading-space trap with a toy BPE-like vocab so you can read the collision in a dict. NumPy to count tokens per language across a real paragraph and turn the dialect tax into arithmetic. And tiktoken — OpenAI's actual GPT-4 tokenizer — to measure what you'll actually be billed for.

layer 1 — pure python · leading_space_demo.py
python
# a toy vocabulary that models the leading-space behavior
# every word exists twice: with and without the preceding space
vocab = {
    "hello":   5,    " hello":   9,     # same letters, different IDs
    "world":   6,    " world":  10,
    "the":     7,    " the":    11,
    "<unk>":   0,
}

def tokenize(text):
    # walk left-to-right, greedy-matching the longest key that fits
    out, i = [], 0
    while i < len(text):
        for length in range(min(8, len(text) - i), 0, -1):
            piece = text[i : i + length]
            if piece in vocab:
                out.append(vocab[piece])
                i += length
                break
        else:
            out.append(vocab["<unk>"]); i += 1
    return out

print('"hello"  → token id', tokenize("hello")[0])
print('" hello" → token id', tokenize(" hello")[0])
print("These are different rows in the embedding table.")
stdout
"hello"  → token id 5
" hello" → token id 9
These are different rows in the embedding table.

Vectorise it. Count tokens across a multi-language dataset and compute the cost-ratio we promised you — the Babel tax, in a single division. Here's a minimal NumPy-y version against a tiny hand-written tokenizer so you can see the arithmetic without the BPE table getting in the way.

layer 2 — numpy · multilingual_ratio.py
python
import numpy as np

# stand-in for a real tokenizer: char-level for non-English,
# word-level for English. Real ratios are similar in spirit.
def fake_tokenize(text, lang):
    if lang == "english":
        return text.split()                    # whole words, 1 token each
    return list(text.replace(" ", ""))          # 1 token per character (ish)

samples = {
    "english":  "the cat sat on the mat in the warm afternoon sun",
    "japanese": "暖かい午後の日差しの中でマットの上に座っている猫",
}

rows = []
for lang, text in samples.items():
    n_chars  = len(text)
    n_tokens = len(fake_tokenize(text, lang))
    rows.append((lang, n_chars, n_tokens, n_tokens / n_chars))

arr = np.array([[r[1], r[2], r[3]] for r in rows], dtype=float)
for (lang, *_), row in zip(rows, arr):
    print(f"{lang:9s}: {int(row[0]):3d} chars → {int(row[1]):3d} tokens   ratio {row[2]:.2f}")

ratio = arr[1, 1] / arr[0, 1]
print("-" * 50)
print(f"japanese costs {ratio:.2f}x more tokens for the same meaning.")
stdout
english  : 47 chars →  11 tokens   ratio 0.23
japanese : 23 chars →  22 tokens   ratio 0.96
--------------------------------------------------
japanese costs 2.00x more tokens for the same meaning.

Now the real thing. tiktoken is the actual BPE tokenizer OpenAI ships, and cl100k_base is GPT-4's encoding. Ten lines, and you can measure every edge case we've talked about — leading spaces, number chunking, the non-English tax — against the exact table your API bill is computed from.

layer 3 — tiktoken · real_tokenizer.py
python
import tiktoken

enc = tiktoken.get_encoding("cl100k_base")       # the GPT-4 / GPT-3.5 tokenizer

def show(label, s):
    ids = enc.encode(s)
    print(f"{label!r:20s}{ids!s:40s} ({len(ids)} tokens)".replace(",)", ")"))

show("hello",           "hello")
show(" hello",          " hello")                # leading space ⇒ different token
show("The quick brown", "The quick brown")
show("1234567890",      "1234567890")            # ten digits, 3 tokens
show("123456789",       "123456789")             # nine digits, also 3 tokens

english  = "the cat sat on the mat in the warm afternoon sun"
japanese = "暖かい午後の日差しの中でマットの上に座っている猫"
print(f"Japanese 12-char   → {len(enc.encode(japanese))} tokens    "
      f"(~{len(enc.encode(japanese)) / len(enc.encode(english)):.1f}x vs English)")
stdout
"hello"            → [15339]                              (1 token)
" hello"           → [24748]                              (1 token, different id!)
"The quick brown"  → [791, 4062, 14198]                   (3 tokens)
"1234567890"       → [4513, 10961, 20652]                 (3 tokens)
"123456789"        → [4513, 10961, 2366]                  (3 tokens)
Japanese 12-char   → 28 tokens    (~2.3x vs English)
pure python → numpy → tiktoken
vocab = {"hello": 5, " hello": 9, ...}←→tiktoken.get_encoding("cl100k_base")

a toy dict becomes a trained 100k-entry BPE table

tokenize(text) # longest-match walk←→enc.encode(text) # production BPE, merges in C

same idea — greedy-ish subword matching — but orders of magnitude faster and trained on the open web

n_tokens / n_chars # per-language ratio←→len(enc.encode(text)) / len(text)

the formula that tells you how much more a non-English user pays

Gotchas

“1 word ≈ 1 token”: approximately true for short common English, wildly wrong everywhere else. Don't build a token budget from a word count. Use enc.encode and count.

String equality ≠ token equality: "hello".strip() == "hello" is true as a Python string; the tokenized versions collide into different IDs. If you're caching on tokenized input, normalize first.

Forgetting the end-of-text token: most models were trained with a special <|endoftext|> token marking the boundary of a document. Tokenize your chat history without it, and the model sees one giant document instead of a turn-taking dialogue. Behaviors shift subtly.

BOS vs no-BOS confusion: Llama-family models prepend a <BOS> token; OpenAI's BPE doesn't. If you copy a prompt from one ecosystem to another without re-tokenizing, the first position shifts by one and alignment goes bad silently.

NFC vs NFD Unicode: "é" can be encoded as one code point (NFC) or two (NFD: an e plus a combining accent). Copy-pasting from a Mac to a Linux terminal can silently switch between them. The tokenizer treats them as different strings — another edge case where two things that look identical aren't.

Measure the non-English tax

Pick a paragraph of English — say, the first paragraph of a Wikipedia article. Find the same paragraph in Japanese (Wikipedia is great for this — articles often have translations). Encode both with tiktoken.get_encoding("cl100k_base").

Compute tokens_ja / tokens_en. You should see somewhere between 2.0 and 4.5 depending on the paragraph. Then do it again with Hindi or Arabic. Then one more: try o200k_base (the newer GPT-4o tokenizer) and see how much it narrowed the gap.

Bonus: write a function estimate_cost(text, lang) that takes a string and a language tag and returns a dollar figure, using GPT-4o's published per-token price. Use it to audit your real product's traffic mix.

What to carry forward. The tokenizer is a frozen, imperfect compressor that sits between your text and the model's integer interface, and every edge case is a place where its grammar breaks down. Leading spaces collide into different IDs. Numbers chunk weirdly and fool the model's arithmetic. Non-English dialects pay a Babel tax of 2x to 15x. Glitch tokens are real and occasionally ruinous. When your LLM-backed product is doing something weird, the tokenizer is always your second suspect — right after the prompt itself — and sometimes it is genuinely the culprit.

Next up — gpt-data-loader. Now that we know how individual strings tokenize (and misbehave), we can stop talking about one string at a time and start talking about datasets. How do you take a folder of text files, run them through the tokenizer, chunk the output into training examples of fixed context length, and stream it to the GPU without running out of memory? That's the data loader, and it's the unglamorous plumbing that every real training run depends on.

References