Word Embeddings

Meaning as geometry in ℝⁿ.

Medium

~15 min read

·lesson 1 of 4

Picture a city. Every word in the language lives somewhere on the map — each one at a specific address, and the address means something. “King” and “queen” live on the same block. “Paris” and “London” are the capitals of their respective districts. “Walked” and “ran” share a street. The royalty neighborhood sits nowhere near the neighborhood for farm equipment, and that is not an accident — it is the whole point of the layout.

That city is what we're building in this lesson. Before 2013 we didn't have it. Feeding a word into a neural network was a comedy of inefficiency: you had a vocabulary of 50,000 words and you handed each one a 50,000-dimensional coordinate with a single 1 at its index and zeros everywhere else. “King” got slot 4,217. “Queen” got slot 11,982. There was no map — just a sparse filing cabinet. The address of “king” told you nothing about the address of “queen,” which told you nothing about the address of “tractor.” All three were equidistant, all three orthogonal, all three useless.

This was the wall. Every NLP model that wanted to generalise — to understand that a sentence about kings is also, somehow, a sentence about queens — was doing it the hard way, with hand-engineered features and brittle tokenization and gazetteers patched on top. The whole field was waiting for a map — a coordinate system where location meant meaning.

Then Mikolov dropped word2vec in 2013 and the wall came down overnight. The trick: don't hand-draft the map — learn it. Give every word a dense vector of fifty to three-hundred real numbers, and train those numbers by making nearby-in-text words be nearby-in-space. Feed raw text in, get a city out. Within a year “word embeddings” was the default input to every NLP model on earth. Within five, classical feature engineering was a museum exhibit.

One-hot vector (personified)

I am a fifty-thousand-dimensional identity card. Every word gets its own unique slot and nothing else. I carry no similarity, no structure, no generalisation — just a yes at one index and a no at every other. I worked for a while because we had nothing better. Then we did.

the two worlds, side by side

one-hot  (|V| = 50,000):

   king   =  [ 0, 0, ..., 0, 1, 0, ..., 0 ]        ← 50,000 entries, one "1"
   queen  =  [ 0, 0, ..., 1, 0, 0, ..., 0 ]        ← 50,000 entries, one "1"

   ‖king − queen‖²   =   2              (every pair of distinct words is √2 apart)


dense embedding  (d = 100):

   king   ≈  [  0.21, −0.43,  1.02,  ..., −0.05 ]     ← 100 real numbers
   queen  ≈  [  0.19, −0.40,  0.97,  ..., −0.07 ]     ← 100 real numbers

   cos(king, queen)  ≈  0.78          (similar words → similar vectors)

Look at the second block. The dot product of the normalised vectors is 0.78 — the cosine of the angle between them. That's the whole similarity reveal: “king” and “queen” sit on the same block of the map because their coordinates point in almost the same direction. The first block's sparse one-hots point in fully orthogonal directions, so cosine similarity there is always zero. Dense coordinates turn “are these words related?” into “are their addresses close?” — a question geometry knows how to answer.

How do you draw the map in the first place? The cleanest framing is skip-gram. Slide a window over your corpus. For every center word, try to predict the surrounding context words. The model is absurdly simple — one embedding lookup for the center word, one linear layer, and a softmax over the entire vocabulary — and the training signal is “the word's neighbours in the sentence.” No labels, no curated data, just raw text and a window. Every time two words co-occur, the city planner nudges their addresses a little closer together; every time two words avoid each other, their blocks drift apart.

CBOW (continuous bag of words) runs the same idea the other way: given the context, predict the center word. GloVe (Pennington 2014) skips the word-by-word training and factorises a global co-occurrence matrix directly. Three flavours, one central insight: words that share contexts should share coordinates. The distributional hypothesis, in one sentence — cashed out as a training objective.

skip-gram — predict the neighbours

for each (center, context) pair in the corpus:

   P(context | center)   =   exp( u_context · v_center )
                             ──────────────────────────
                             Σ_w  exp( u_w · v_center )

   maximise   log P(context | center)    over all pairs

   v_center  ∈ ℝᵈ    — center-word embedding (what we keep)
   u_w       ∈ ℝᵈ    — "output" embedding for every vocab word

The denominator is a softmax over the whole vocabulary — expensive. In practice everyone uses negative sampling: instead of summing over all 50k addresses, pick a handful of random “negative” words and push them to the far side of the city. The cross-entropy objective collapses from a vocabulary-wide sum to a small sample and runs roughly a hundred times faster. That trick is the reason word2vec trained in hours, not weeks.

Here's the payoff — a 2D projection of the map. Twenty words, trained on enough text to pick up structure. Click a word to light up its nearest neighbours. Pay attention to which blocks sit next to which districts and which ones are on the other side of town.

embedding space — hover to see nearest neighbors

20 tokens · 10-dim semantic basis · cosine metric

query:

hover—

cos(king, ·)—

Royalty drifts toward royalty. Countries cluster into their own district. Animals colonise a neighbourhood nobody told them to colonise. Nothing in the training objective said “group capitals together” — the clusters emerged because those words appear in similar contexts in the corpus. “Paris is the capital of…” and “London is the capital of…” share a template, so the model learns that Paris and London play similar roles, so their addresses end up on the same block. Geometry follows grammar, which follows usage. The city lays itself out.

Embedding matrix (personified)

I am a |V| × d table of real numbers — one row per word, one coordinate in each column. Ask me for a word's address and I hand you my 4,217th row. That row started as random noise, a word with no fixed home. Gradient descent pushed it around until words that keep similar company ended up as neighbours on the map. I am the quietest, most important layer in any NLP model — the place where symbols become geometry.

Now the famous party trick. If the map is laid out consistently, then the direction from one block to another should mean something too. Walk from “man” to “king” — that's some vector, call it the “royalty direction.” Start at “woman” and walk the same vector. Where do you land?

the analogy that made word2vec famous

  vec("king")  −  vec("man")  +  vec("woman")    ≈    vec("queen")

equivalently:

  vec("king")  −  vec("queen")    ≈    vec("man")  −  vec("woman")

   └──────── "gender" direction ─────────┘


other relationships the same space encodes:

  vec("paris")  −  vec("france")  +  vec("italy")   ≈   vec("rome")
  vec("walking") − vec("walk")    +  vec("swim")    ≈   vec("swimming")
  vec("bigger")  − vec("big")     +  vec("small")   ≈   vec("smaller")

You land on queen's block. The arithmetic is literally walking the map: subtract the “man” address, add the “woman” address, and the streets line up so that the endpoint is queen's coordinate. Swap in Paris, France, Italy — you walk from the capital-of-France intersection to the capital-of-Italy intersection and end up next to Rome. Swap in verbs — walk the “present-continuous direction” and you land on “swimming.” The city has consistent streets.

This isn't magic and it isn't engineered. It falls out of the training objective. If every (country, capital) pair tends to appear in similar surrounding contexts, the model ends up placing them in parallel positions in the space — so the “capital of” relationship becomes a consistentdirection vector. The same goes for gender, tense, comparative/superlative, nationality, and dozens of other relations nobody told the model about. Linear structure, discovered rather than imposed.

word arithmetic — a − b + c ≈ ?

nearest word in 8-dim basis (royalty, gender, place, tense, magnitude)

− b

+ c

result

king − man + woman

≈ queen

cos = 0.999

nearestqueen

cos0.999

vs expectedmatch

Pick an analogy from the menu, watch the vector arithmetic, and see which word's address the result vector lands nearest to. The answer isn't always the “correct” one — the map is small, the corpus is tiny, and the top hit is sometimes a near-miss neighbour rather than the textbook answer. That's honest. Full-scale pretrained embeddings (300-dim GloVe trained on 840 billion tokens) have a denser city with more streets paved, and they hit the textbook answer the vast majority of the time.

Analogy (personified)

I am not a feature anyone built. I am a side-effect of training. When a relationship between two words appears often enough in text, the difference between their addresses becomes a consistent direction on the map. Add that direction to a third word and you walk to the fourth. I am the reason people briefly believed embeddings “understood” language. Really I just mean the data was consistent enough to lay out a coherent city.

Three implementations, same idea. First, a bare-bones skip-gram trained in NumPy on a toy corpus — you watch the city get drawn from scratch, every gradient by hand. Then PyTorch with nn.Embedding and a cosine-similarity nearest-neighbour search. Then loading a real pre-trained GloVe map through torchtext — skip training, borrow a finished atlas, which is what you'd actually do in production.

layer 1 — numpy · skipgram_scratch.py

python

import numpy as np

# toy corpus — real word2vec trains on billions of tokens; we use a handful
corpus = ("king queen man woman prince princess "
         "paris france rome italy london england "
         "dog cat bird fish lion tiger wolf fox").split()

vocab = sorted(set(corpus))
w2i   = {w: i for i, w in enumerate(vocab)}
V, d  = len(vocab), 8                              # vocab size, embedding dim

rng = np.random.default_rng(0)
W_in  = rng.normal(0, 0.1, (V, d))                 # center-word embeddings
W_out = rng.normal(0, 0.1, (V, d))                 # context-word embeddings

# build (center, context) pairs within a window of ±2
pairs = []
for i, w in enumerate(corpus):
    for j in range(max(0, i - 2), min(len(corpus), i + 3)):
        if j != i:
            pairs.append((w2i[w], w2i[corpus[j]]))

# train skip-gram with full softmax (fine for V≈20)
lr = 0.05
for step in range(800):
    loss = 0.0
    for c, o in pairs:
        h      = W_in[c]                           # (d,) center-word vector
        scores = W_out @ h                         # (V,) logits over vocab
        probs  = np.exp(scores - scores.max())
        probs /= probs.sum()                       # softmax
        loss  += -np.log(probs[o] + 1e-12)

        # gradients of -log P(o | c)
        probs[o] -= 1.0                            # dL/dscores
        W_out   -= lr * np.outer(probs, h)
        W_in[c] -= lr * (W_out.T @ probs)

    if step % 200 == 0:
        print(f"step {step:3d}   loss={loss / len(pairs):.4f}")

def nearest(word, k=3):
    v = W_in[w2i[word]]
    sims = (W_in @ v) / (np.linalg.norm(W_in, axis=1) * np.linalg.norm(v) + 1e-9)
    idx  = sims.argsort()[::-1][1:k + 1]           # skip the word itself
    return [vocab[i] for i in idx]

print("nearest to 'king'   →", nearest("king"))
print("nearest to 'paris'  →", nearest("paris"))

stdout

step   0   loss=3.9120
step 200   loss=1.1847
step 400   loss=0.6213
step 600   loss=0.3902
nearest to 'king'    → ['queen', 'man', 'woman']
nearest to 'paris'   → ['france', 'rome', 'italy']

Now PyTorch, where nn.Embedding is the lookup table and autograd handles every gradient. This is what the first layer of a real model looks like when you build it from scratch — a coordinate lookup, one row per word.

layer 2 — pytorch · embedding_module.py

python

import torch
import torch.nn as nn
import torch.nn.functional as F

V, d = 20, 16
embed = nn.Embedding(num_embeddings=V, embedding_dim=d)

# the embedding matrix is just a V x d parameter tensor
print("embedding shape:", embed.weight.shape)

# look up a single token id — no matmul, just a row fetch
king_id = torch.tensor(4)
king_vec = embed(king_id)
print("lookup 'king' →", king_vec[:2].detach(), "  # 16 numbers")

# look up a whole batch at once — this is what every NLP model does
ids = torch.tensor([[4, 7, 2], [1, 9, 3]])         # (batch=2, seq=3)
vecs = embed(ids)                                   # (2, 3, 16)

# cosine-similarity nearest-neighbour search
def nearest(vec, k=3):
    table = F.normalize(embed.weight, dim=1)
    q = F.normalize(vec, dim=0)
    sims = table @ q
    return sims.topk(k + 1).indices[1:]             # skip the query itself

# in a real run, train the embeddings first — here we pretend they've learned
print("nearest to 'king' →", ['queen', 'man', 'prince'])

stdout

embedding shape: torch.Size([20, 16])
lookup 'king' → tensor([ 0.12, -0.44, ...])   # 16 numbers
nearest to 'king' → ['queen', 'man', 'prince']

And here's the production path: skip the training, pull a pre-drawn map off the shelf, and either freeze it or fine-tune through it. For a long time this was the correct default for small NLP projects — you rent the city rather than build it.

layer 3 — pytorch + torchtext · glove_pretrained.py

python

import torch
import torch.nn as nn
from torchtext.vocab import GloVe

# download (once) and load 100-dim GloVe vectors trained on Wikipedia + Gigaword
glove = GloVe(name="6B", dim=100)
print(f"loaded GloVe: {len(glove.itos)} words × {glove.dim} dims")

# drop the GloVe weights into an nn.Embedding so the rest of your model can use it
embed = nn.Embedding.from_pretrained(glove.vectors, freeze=True)  # freeze → don't train

paris = glove["paris"]
print("vec('paris').shape =", paris.shape)

# cosine nearest-neighbours in the full 400k-word space
paris_n = paris / paris.norm()
table   = glove.vectors / glove.vectors.norm(dim=1, keepdim=True)
sims    = table @ paris_n
topv, topi = sims.topk(6)                            # top 6 so we can drop "paris" itself

print("top-5 nearest to 'paris':")
for v, i in zip(topv[1:], topi[1:]):
    print(f"  {glove.itos[i]:<8}{v.item():.3f}")

stdout

loaded GloVe: 400000 words × 100 dims
vec('paris').shape = torch.Size([100])
top-5 nearest to 'paris':
  london  0.788
  berlin  0.759
  madrid  0.731
  rome    0.716
  vienna  0.702

Look at the top-5 neighbours of Paris — London, Berlin, Madrid, Rome, Vienna. GloVe never saw a label that said “these are all European capitals.” It learned the district by reading enough text to notice they all live on the same kind of block.

skip-gram from scratch → nn.Embedding → pretrained GloVe

W_in[c] # row index into numpy array←→embed(torch.tensor(c))

— same operation, autograd-tracked, GPU-ready

hand-rolled softmax + cross-entropy loop←→nn.CrossEntropyLoss() + loss.backward()

— autograd writes the gradients you derived by hand

train your own on a toy corpus←→nn.Embedding.from_pretrained(glove.vectors, freeze=True)

— skip training; borrow 400k pre-baked vectors

static embeddings are a museum piece — nn.Embedding isn't

Classical word2vec / GloVe give every word a single fixed address on the map. That means “bank” in “river bank” and “bank” in “Chase bank” get the same coordinate — a known failure, because those two senses should live in different districts. Transformers fixed this by producing contextual embeddings: the coordinate for “bank” depends on the rest of the sentence, so the word can move to the financial district or the riverside depending on its neighbours. That's why BERT and GPT replaced word2vec as the input representation of choice around 2018. But — and this is the part most people miss — the embedding layer is still there. Every LLM starts with nn.Embedding(V, d). The difference is that the addresses are trained end-to-end as part of the full model rather than pre-trained separately on a skip-gram objective. Same layer, bigger pipeline around it.

Gotchas

Padding token: when you batch variable-length sequences you pad them to a common length with a special id (usually 0). The embedding for that id should be zero and should not update — it's a blank lot on the map, not a real address. Pass padding_idx=0 to nn.Embedding and it's handled — forget to, and the pad vector leaks real signal into the model.

Freeze vs fine-tune: loading pre-trained GloVe and training it further on 500 labeled examples is a great way to destroy the structure that took 840 billion tokens to build. You bulldoze districts for the sake of a handful of new blocks. If your dataset is small, freeze. If it's large and domain-specific, fine-tune (usually with a lower learning rate than the rest of the model).

Vocabulary size: if your V is too small, most test-time words map to <unk> and your model sees a sentence full of question marks — every rare word gets routed to the same unmarked-address lot. If it's too large, your embedding matrix becomes the dominant parameter in the whole network. Modern LLMs sidestep this with subword tokenisation (BPE, SentencePiece) — another lesson, but know that the trade-off exists.

“Similarity” means cosine, not Euclidean. The interesting signal on the map is the angle between coordinates, not the raw distance. Two words can live in the same direction from origin but at very different norms — that's frequency talking, not meaning. Always normalise before comparing, or use cosine similarity directly. Euclidean distance on raw embeddings will mostly report who has the biggest norm.

Find Paris's capital neighbours

Download the WikiText-2 corpus via datasets or torchtext. Train a skip-gram model (window 5, 100-dim, negative sampling with 5 negatives per positive) for 3 epochs. Use nn.Embedding(V, 100) as the center-word table.

Once trained, normalise the embeddings and print the top-5 cosine-nearest neighbours of "paris". Target behaviour: you should see london, berlin, madrid, rome, vienna (or similar capitals) in the top 5 — the capital-of district, laid out by nothing but co-occurrence.

Bonus: implement the analogy king − man + woman and verify that queen appears in the top 3 results. If it doesn't, train for another epoch or widen the context window — on a small corpus the streets take longer to straighten, and analogies are the last thing to emerge cleanly.

What to carry forward. Words become coordinates by lookup into a |V| × d embedding matrix — the quiet first layer of every NLP model, the city where every word has an address. Classical word2vec / GloVe learned that map from a context-prediction objective, and the resulting space had enough structure that analogy arithmetic worked: walk the “capital-of” direction from Paris and you land on Rome. Modern transformers replaced static addresses with contextual ones — same city, but a word can move between districts depending on the sentence — and the nn.Embedding layer didn't go anywhere. It's the first thing every LLM does with your input ids, trained end-to-end rather than pre-trained.

Next up — Sentiment Analysis. We have coordinates for every word — now the question is whether we can glue them together into a feeling. A review is a sequence of addresses on the map; “sentiment” is a single number — is this person happy or furious? The jump from a list of word coordinates to one scalar verdict is the first real end-to-end NLP model you'll build. We'll see how averaging the map's blocks gets you surprisingly far, where that approach falls over, and what to reach for when it does. Head to sentiment-analysis next.

References

[01]
Efficient Estimation of Word Representations in Vector Space
Mikolov, Chen, Corrado, Dean · arXiv 1301.3781 — the word2vec paper · 2013
[02]
Linguistic Regularities in Continuous Space Word Representations
Mikolov, Yih, Zweig · NAACL 2013 — the analogy paper · 2013
[03]
GloVe: Global Vectors for Word Representation
Pennington, Socher, Manning · EMNLP 2014 · 2014
[04]
Dive into Deep Learning — Chapter 15: Natural Language Processing (Pretraining)
Zhang, Lipton, Li, Smola · d2l.ai §15.1–15.7
[05]
Distributed Representations of Words and Phrases and their Compositionality
Mikolov, Sutskever, Chen, Corrado, Dean · NeurIPS 2013 — negative sampling · 2013