ml ›section 14 of 14

Inference & Serving

Ship the model — make it fast, cheap, and production-ready

5 lessons·1medium4hard

Lessons

in order

Why lower precision is an (almost) free lunch.

Post-training quantization and QAT, in detail.

A small model drafts, the big model verifies — 2-3× faster.

The throughput trick that makes production LLMs economical.

vLLM's virtual-memory-inspired KV cache.