Intro to Natural Language Processing

Tokens, vocabs, and the classical NLP pipeline.

Easy

~15 min read

·lesson 2 of 4

A neural net is a matrix multiply with ambitions. It adores numbers and cannot read. If you hand it the sentence “The cat sat on the mat.” it will stare at you, politely, the way a calculator stares at a poem. It does not know what a c is. It does not know what a word is. It does not, in fact, know what a sentence is. It knows tensors.

So before any RNN, any transformer, any MLP perched on top of word embeddings, the text has to be converted. Not gently translated — converted, the way a warehouse converts raw cotton into thread. That conversion happens on the assembly line that turns raw text into a clean tensor. Messy paragraphs come in one end at the loading dock. Three stations work on them in sequence. A row of integers rolls off the other end, ready to be handed to the first neuron in the model. That assembly line is what this lesson is about.

Natural Language Processing is, before anything else, that line. Every transformer, every sentiment classifier, every chatbot you have ever used runs the same conveyor: raw text in, integer IDs out, and those integers are what the network actually sees. This lesson walks the line end to end — cleaning, tokenizing, vectorizing — and explains why each station looks the way it does. By the end, when you read “we used a 50k BPE vocabulary” in a paper, it will mean something concrete: a particular station configured a particular way.

      ┌──────────────────┐
      │   raw text       │   "The cat sat on the mat."
      └────────┬─────────┘
               │  preprocess  (lowercase? strip punct? unicode normalize?)
               ▼
      ┌──────────────────┐
      │   normalized     │   "the cat sat on the mat"
      └────────┬─────────┘
               │  tokenize    (split into units: chars / words / subwords)
               ▼
      ┌──────────────────┐
      │   tokens         │   ["the", "cat", "sat", "on", "the", "mat"]
      └────────┬─────────┘
               │  vocab lookup  (hashmap: str → int)
               ▼
      ┌──────────────────┐
      │   ids            │   [5, 241, 908, 17, 5, 612]
      └────────┬─────────┘
               │  embed         (id → dense vector, learnable)
               ▼
      ┌──────────────────┐
      │   vectors        │   tensor of shape [seq_len, d_model]
      └────────┬─────────┘
               │
               ▼
             model

the standard NLP input pipeline

Four arrows on the conveyor, four distinct engineering decisions. Skip the cleaning station and the model has to waste parameters learning that Cat and cat are the same animal. Pick the wrong tokenizer at the second station and half of Twitter shows up as <UNK> <UNK> <UNK> — unrecognizable crates falling off the belt. Get the vocab wrong at the third and your embedding table is 300MB for a model that only needs 30MB of parameters. Every production NLP system lives and dies on the boring middle of the line.

The loading dock is dull — a string is a string — and the third station is a hashmap, which we'll get to. The interesting machinery is the middle one: the tokenizer station, where the text gets chopped into reusable pieces. Different tokenizers are different blades mounted on the same conveyor. The fastest way to feel what they do is to watch four of them chew on the same sentence. Type anything below. Each row shows how one blade slices your input, and how big its vocabulary is.

tokenizer playground — four ways to chop the same sentence

compression ratio = chars / tokens (higher = coarser)

whitespace

n = 9ratio = 4.78

split on spaces — the simplest thing that works

the11980quick8012brown22335fox16270jumps30122over26447the11980lazy11615dog23817

word

n = 9ratio = 4.78

word-level with punctuation split off separately

the11980quick8012brown22335fox16270jumps30122over26447the11980lazy11615dog23817

bpe

n = 26ratio = 1.65

byte-pair encoding with a hand-authored merge table

the11980_26030q44124u33648i23172c7458k28410_26030b35077r26981own11887_26030fox16270_26030jump47405s49362_26030o17934v16505er9446_26030the11980_26030lazy11615_26030dog23817

char

n = 43ratio = 1.00

each character is a token — longest sequences, tiniest vocab

t11267h791e41744_26030q44124u33648i23172c7458k28410_26030b35077r26981o17934w38886n45553_26030f24601o17934x42695_26030j6029u33648m12696p21743s49362_26030o17934v16505e41744r26981_26030t11267h791e41744_26030l40315a2220z47933y15076_26030d19363o17934g46982

quick-picks:

chars43

Three numbers to watch. The vocab size (how many distinct tokens the blade knows about), the token count (how many pieces fall off the conveyor for your sentence), and — implicitly — the tradeoff between them. They are inversely related. Tiny vocab, tiny pieces, long sequence. Huge vocab, fat pieces, short sequence, but most entries in the table never get used and each one costs embedding parameters regardless.

Character-level. Vocab: about 100 (the printable ASCII range, plus a handful of Unicode oddities). No out-of-vocabulary problem — every word decomposes into known characters. The catch: sequences become long. A 100-word paragraph is ~500 characters on the belt, and transformer compute scales quadratically with sequence length.
Word-level. Vocab: 50k–100k. Natural unit for English readers. The catch: every new word — misspellings, names, new slang, foreign loanwords — gets stamped <UNK> at the station and rolls on unchanged. Word-level blades have a generalization ceiling hard-coded into them on day one.
Subword (BPE, WordPiece, SentencePiece). Vocab: 32k–50k. The middle path. Common words are single tokens (the, cat); rare words get decomposed into meaningful pieces (un + fathom + able). OOV essentially disappears because the blade can always fall back to characters. This is what every modern LLM uses. GPT-4 uses a ~100k BPE. Llama uses a ~32k SentencePiece.

Tokenizer (personified)

I am the second station on the line. My one job: take your string of Unicode and slice it into pieces small enough to fit in a vocabulary, big enough to mean something. Pick me poorly and your model spends a third of its parameters memorizing that Tokenization is one word. Pick me well and it learns that Token, ization, and ##s are reusable building blocks for half the English lexicon.

To see why the word-level blade jams, we need to talk about the raw material. Text is not a tidy uniform stream; it is a wildly uneven one, and the unevenness is mathematically lawful. If you rank words by frequency in a large corpus, the count of the n-th most common word is roughly inversely proportional to its rank. This is Zipf's law, noted by the linguist George Zipf in 1949, and it's the reason the assembly line needs a cleverer station than “split on spaces.”

Zipf's law — the empirical frequency-rank relationship

f(n)  ∝   1
         ───
          nˢ

where  n = rank of the word (1 = most common)
       f = frequency count
       s ≈ 1  for natural language

Take the log of both sides and the relationship becomes linear: log f = −s · log n + c. On a log-log plot, word frequency against rank is a straight line with slope about −1. It holds, for basically every natural language ever measured, across corpora of books, newspapers, web text, and tweets. Punch it in and watch.

vocab size vs corpus coverage

synthetic Zipfian · corpus = 100,000 tokens

top-10 token frequencies

the

12,241

5,830

and

3,778

2,777

2,187

1,800

1,526

that

1,323

1,166

was

1,042

long-tail collapse: 33.7% of all tokens come from just these 10.

log₁₀|V||V|=1,995Zipf s1.07

coverage79.3%

oov20.7%

That straight line on the log-log axis is the long tail, and it's the entire reason the subword blade exists. The top few thousand words (the, of, and, to…) make up about 80% of all tokens in typical English text. The remaining 20% of tokens come from a vocabulary of hundreds of thousands of words, most of which appear fewer than five times in a corpus of a million words. Your assembly line has to process all of them.

A word-level blade is forced into an uncomfortable choice: either add every rare word to the vocab (embedding table explodes, and the rare embeddings never learn anything because they see almost no gradient) or truncate the vocab at some frequency cutoff and dump everything below it into one bucket. That bucket is <UNK>, and it is where language goes to die.

<UNK> (personified)

I am the stand-in for every word you never taught your model. “Anthropic.” “cryptocurrency.” “Pikachu.” All me. I am one embedding vector, and I am supposed to represent the entire infinite set of words your tokenizer never saw during training. I do a bad job of it. This is why subword tokenizers mostly put me out of work.

Beyond <UNK>, every model family reserves a handful of special tokens — extra crates the station always stamps in, whose roles have nothing to do with language per se. They're structural markers the model learns to attend to, riding the belt next to the ordinary tokens.

[CLS] — BERT's “classification” token. Prepended to every input; its final hidden state is used as a sentence embedding.
[SEP] — BERT's separator between two sentences in a pair task.
[PAD] — padding token. Added to shorter sequences so that a batch has uniform length. The attention mask tells the model to ignore it.
<s>, </s> — beginning-of-sequence and end-of-sequence, common in GPT-style and T5 models.
<|endoftext|> — GPT-family document separator.

These live in the vocabulary alongside ordinary tokens and consume IDs like any other. When you read a paper and see a vocab size of 50,257, that includes the specials.

The third station — vectorize — is almost anticlimactic after the tokenizer. It's two lookup tables kept in sync:

token2id: a hashmap from string to integer. Constant-time lookup as each token rolls past on the conveyor. “cat” → 241.
id2token: a list where index i holds the string for id i. Constant-time lookup when you're running the line in reverse to decode. 241 → “cat”.

That's it. The “vocabulary” is just those two structures, saved to disk as JSON or a binary. Every tokenizer library you will ever use — Hugging Face's tokenizers, SentencePiece, torchtext — is, at its core, wrapping these two lookups plus whatever splitting algorithm the tokenizer station upstream decided on.

Three implementations of the same assembly line, climbing in sophistication. Pure Python so you can see every bolt. NumPy to watch the long tail emerge in real numbers. And a real tokenizer library — the industrial version of the line, which is what actually ships.

layer 1 — pure python · word_tokenizer.py

python

import string

def preprocess(text):
    text = text.lower()
    # strip punctuation, keep word characters and whitespace
    text = text.translate(str.maketrans("", "", string.punctuation))
    return text

def tokenize(text):
    return preprocess(text).split()           # whitespace split — the naive word tokenizer

def build_vocab(corpus, specials=("<PAD>", "<UNK>")):
    tokens = set()
    for sent in corpus:
        tokens.update(tokenize(sent))
    # specials get the lowest ids by convention
    id2token = list(specials) + sorted(tokens)
    token2id = {tok: i for i, tok in enumerate(id2token)}
    return token2id, id2token

def encode(text, token2id):
    unk = token2id["<UNK>"]
    return [token2id.get(tok, unk) for tok in tokenize(text)]

def decode(ids, id2token):
    return " ".join(id2token[i] for i in ids)

corpus = ["The cat sat on the mat.", "The dog sat on the log."]
token2id, id2token = build_vocab(corpus)
ids = encode("The cat sat on the mat.", token2id)

print(f"vocab size: {len(id2token)}")
print(f"ids: {ids}")
print(f"decoded: {decode(ids, id2token)}")

stdout

vocab size: 8
ids: [5, 1, 4, 2, 5, 3]
decoded: the cat sat on the mat

That's the whole line in thirty lines of Python. preprocess is the loading dock, tokenize is the station, encode is the lookup. Tensor out. Now the frequency analysis that reveals Zipf's law — we run the tokenizer station over a real corpus and count what falls off the belt. NumPy makes this a three-line affair.

layer 2 — numpy · zipf_analysis.py

python

import numpy as np
from collections import Counter

def load_corpus(path):
    with open(path) as f:
        return f.read().lower().split()

tokens = load_corpus("wikipedia_sample.txt")
counts = Counter(tokens)

# sort tokens by frequency, descending
ranked = counts.most_common()
ranks  = np.arange(1, len(ranked) + 1)                 # 1, 2, 3, ...
freqs  = np.array([c for _, c in ranked])              # aligned counts

# fit a line in log-log space:  log f = slope · log n + intercept
slope, intercept = np.polyfit(np.log(ranks), np.log(freqs), 1)
print(f"slope of log-log fit: {slope:.2f}  (Zipf predicts -1)")

# the long tail in one number: fraction of tokens that appear only once
hapax = (freqs == 1).sum() / len(freqs)
print(f"hapax legomena (words seen exactly once): {hapax:.1%}")

stdout

top-10 tokens:
  rank  1: "the"  count=2134
  rank  2: "of"   count=1012
  rank  3: "and"  count=987
  ...
slope of log-log fit: -1.02  (Zipf predicts -1)

Run this on any sizeable English corpus and the slope comes out between −0.9 and −1.1. Shakespeare, Wikipedia, Reddit, scientific papers — they all obey the same law. The fraction of words appearing exactly once (the hapax legomena) is typically 40–60%. Half your vocabulary is words the model sees a single time. No amount of cleverness at the loading dock fixes that — the tokenizer station itself has to get smarter.

layer 3 — hugging face tokenizers · bpe_in_practice.py

python

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import ByteLevel

# Byte-level BPE — what GPT-2/3/4 use. No <UNK> needed:
# every byte is in the base vocabulary, so any Unicode string encodes.
tokenizer = Tokenizer(BPE(unk_token=None))
tokenizer.pre_tokenizer = ByteLevel(add_prefix_space=False)

trainer = BpeTrainer(
    vocab_size=30_000,
    special_tokens=["<|endoftext|>", "<|pad|>"],
)
tokenizer.train(files=["wikipedia_sample.txt"], trainer=trainer)

print(f"trained vocab size: {tokenizer.get_vocab_size()}")

enc = tokenizer.encode("The cat sat on the mat.")
print(f"encode: {enc.ids}")
print(f"tokens: {enc.tokens}")
print(f"decode: {tokenizer.decode(enc.ids)}")

stdout

trained vocab size: 30000
encode:  [464, 3797, 3332, 319, 262, 2603, 13]
tokens:  ['The', ' cat', ' sat', ' on', ' the', ' mat', '.']
decode:  The cat sat on the mat.

pure python → numpy → production

text.lower().split()←→Tokenizer(BPE(...)).encode(text)

— a one-line naive splitter becomes a learned segmentation

token2id.get(tok, unk_id)←→byte-level base vocab

— production BPE has no OOV — it always falls back to bytes

Counter(tokens).most_common()←→tokenizer.train(files=...)

— frequency counting is how BPE decides which merges to add

Gotchas

Lowercasing: helpful for topic classification, fatal for Named Entity Recognition. “Apple” the company and “apple” the fruit are different entities, and the loading dock shouldn't be throwing away that distinction before the tokenizer sees it. Modern tokenizers are case-sensitive by default for exactly this reason.

Unicode normalization: the string "café" can be encoded as five codepoints (é as one character) or six (e + combining acute accent). Run the wrong normalization form (NFC vs NFD) and identical-looking strings hash to different vocab entries. Always normalize at the loading dock, before the tokenizer ever touches the text.

HTML, emoji, zero-width characters: web-scraped text is full of <br> tags, emoji modifiers, zero-width joiners (U+200D), and right-to-left marks. These look invisible to a human and disastrous to a tokenizer that wasn't trained on them. Strip or normalize deliberately; do not trust your input.

Bytes vs characters: Python 3 str is a sequence of Unicode codepoints. Python 3 bytes is a sequence of 8-bit values. Byte-level BPE (GPT-style) tokenizes bytes, not codepoints — which is why it can handle any text ever written without an <UNK>. If you mix the two up, you get garbled multi-byte characters, and the error propagates silently down the line into your embedding table.

Tokenize a Wikipedia page three ways

Grab the plain-text version of a Wikipedia article (try the one on Tokenization itself — it's on-theme). Feed it to three assembly lines, each with a different blade mounted on the tokenizer station:

A character-level tokenizer (every unique character gets an id).
A word-level tokenizer using text.lower().split().
A BPE tokenizer trained on the page itself with vocab_size=2000 via tokenizers.Tokenizer.

For each, report three numbers: vocab size, token count after encoding the page, and the compression ratio (characters ÷ tokens). Character-level gives ratio ≈ 1. Word-level gives something like 4–6. BPE sits in between — typically 2.5–3.5 — and that's the sweet spot modern LLMs exploit.

Bonus: print the 20 most frequent BPE tokens. You will see function words (the, of), common suffixes (ing, tion), and punctuation. That top-20 is Zipf's law in action.

What to carry forward. Every NLP model in existence runs the same assembly line on its inputs: loading dock (preprocess), tokenizer station (chop), vectorizer station (look up integer IDs), then embed and hand the tensor to the network. The model never touches text — only what the assembly line hands it. The tokenizer blade governs vocab size, sequence length, and OOV behavior all at once, and the math of Zipf's law is why nobody mounts a pure word-level blade anymore. Subword blades (BPE, WordPiece, SentencePiece) are the de facto standard because they gracefully span the full range from the to arbitrary Unicode garbage without a single crate rolling off marked <UNK>.

Next up — Word Embeddings. The assembly line stops at an integer. 241. That's a great ID and a terrible representation — because 241 and 242 are adjacent integers that probably belong to completely unrelated words. Turning a word into a number is easy; turning a word into a number that means something is the next problem, and it's what every embedding table in every model you've ever heard of is there to solve. We'll swap the bare integer coming off the conveyor for a dense vector whose geometry encodes meaning, and watch what that changes.

References

[01]
Human Behavior and the Principle of Least Effort
George K. Zipf · Addison-Wesley, 1949 — the original observation
[02]
Neural Machine Translation of Rare Words with Subword Units
Sennrich, Haddow, Birch · ACL 2016 — BPE for NLP · 2015
[03]
Japanese and Korean Voice Search
Schuster, Nakajima · ICASSP 2012 — the WordPiece paper
[04]
SentencePiece: A simple and language independent subword tokenizer
Kudo, Richardson · EMNLP 2018 · 2018
[05]
Dive into Deep Learning — Ch. 8.1–8.2: Text Preprocessing & Language Models
Zhang, Lipton, Li, Smola