Ship the model — make it fast, cheap, and production-ready
Why lower precision is an (almost) free lunch.
Post-training quantization and QAT, in detail.
A small model drafts, the big model verifies — 2-3× faster.
The throughput trick that makes production LLMs economical.
vLLM's virtual-memory-inspired KV cache.