Quantization Basics
Why lower precision is an (almost) free lunch.
A trained model is a symphony score. Every weight is a note written out in fp32 — four bytes of painstaking notation, every rest and accidental pinned to the nanosecond. It took a small orchestra two weeks and a fortune in GPU time to write this score down. Now you want to ship it.
A 7-billion-parameter full score weighs 7 × 10⁹ × 4 = 28 GB before you've loaded a single KV cache or activation buffer. It doesn't fit on one GPU. It doesn't fit on a MacBook. It barely fits in a conversation about deployment cost.
Here's the inconvenient question: does the audience actually hear the difference between nanosecond-precise notation and millisecond-precise notation? The notes you trained so carefully almost all sit inside a narrow band — say [-0.5, 0.5] — and fp32 is wasting most of its dynamic range on orders of magnitude you'll never touch. What if we snapped each note to the nearest of 256 keys on a much smaller keyboard and stored a single float that says how wide each key is? That's 8 bits per note. The file is a quarter the size. Most listeners won't notice.
This is quantization: the JPEG for weights, the lo-fi mp3 of neural networks. You start with a pristine fp32 score and end with a compressed file that plays almost the same music. You'll learn the quantize/dequantize equations, the precision ladder from fp32 down to int4, the difference between symmetric and asymmetric, per-tensor and per-channel, and why a few outlier notes can single-handedly ruin the recording.
I am the art of saying “close enough.” Your 32-bit weights pretend to have seven decimal digits of precision — a score marked with accidentals no ear on earth can tell apart. I ask: could you get by with two? Usually, the answer is yes, and the file just got four times smaller.
Here is the whole idea in two equations, and two new characters. The scale is the tuning fork: it decides how far apart two adjacent keys on our little keyboard are in float space — how many floats does one integer step cover? The zero-point is middle C: the integer value that corresponds to the float 0.0, the note everything else is measured against. Together they define a simple affine map between floats and integers — the sheet music for our compressed score.
quantize(x) = round( x / s ) + z ∈ [q_min, q_max]
dequantize(q) = s · ( q − z ) ∈ ℝ
where s = (x_max − x_min) / (q_max − q_min) // scale
z = q_min − round(x_min / s) // zero-pointFor int8 the keyboard has 256 keys, labelled [-128, 127]. If your float range is [-0.5, 0.5], the scale — the spacing between adjacent keys — is about 1/255 ≈ 0.00392. Every note gets rounded to the nearest marked key. The rounding error is at most s/2: that's the compression tax, the hiss you pay for the smaller file.
The widget below lets you feel this. Drag the range, watch the continuous float axis get chopped into 256 int8 buckets, and see the quantize then dequantize error curve — a little sawtooth, bounded by half a scale step.
Two things to notice. First, the error is uniform — no note is more than half a key off. Second, the scale determines the fidelity. A tight float range gives a tiny scale and low error — a well-tempered scale, notes almost indistinguishable from the original. A wide float range spreads the same 256 keys over a larger span and the error grows. This is why outliers are so dangerous: one huge weight — a single rogue fortissimo — stretches the range, inflates the scale, and now all your normal notes are being rounded with too coarse a step.
I am the tuning fork. I pack your floats into a fixed number of integer bins. Wider float range, wider bins, coarser rounding. Bring me outliers and I'll ruin every other note to accommodate them. Bring me a tight distribution and I'll compress it almost losslessly.
Quantization isn't a single switch — it's a ladder of audio codecs. At the top, the uncompressed studio master; at the bottom, a 4-bit ringtone. Each rung trades precision for memory and speed. Here's the whole ladder for a 7B model, in memory and multiply-per-cycle terms:
dtype bits bytes/weight 7B weights relative speed (A100) ───── ──── ──────────── ────────── ───────────────────── fp32 32 4 28 GB 1.0 × (baseline) fp16 16 2 14 GB ≈ 2 × bf16 16 2 14 GB ≈ 2 × int8 8 1 7 GB ≈ 2–4 × (tensor cores) int4 4 0.5 3.5 GB ≈ 4–8 × (with kernels)
fp16 and bf16 both use 16 bits but split them differently. fp16 has a 5-bit exponent and 10-bit mantissa — good precision, narrow range (overflow at ~65504). bf16 has an 8-bit exponent and 7-bit mantissa — same dynamic range as fp32, less precision. For training and inference of large models, bf16 is the modern default: the extra range prevents the overflow headaches that plagued fp16 mixed-precision training circa 2018.
int8 is the workhorse mp3 of deployment. Four times smaller than the fp32 master, with < 1% quantization error on typical weight distributions. Both NVIDIA tensor cores and modern CPU AVX have dedicated int8 math units — you get a speedup on top of the memory savings.
int4 is the aggressive end of the compression dial — the lo-fi ringtone. Half as big as int8, quantization error in the 2–5% range. You don't get here with a naive quantize step — you need calibration, clever block-wise scales, or quantization-aware training. This is the regime of QLoRA, GGUF Q4_K_M, AWQ, and GPTQ.
Slide through the rungs. This is the “what you can hear vs what you can't” part of the lesson — the whole economic argument for quantization compressed into one slider. At fp16/bf16 the quality gap vs the fp32 master is barely measurable; the audiophile with the $4000 headphones can't tell. At int8, a well-calibrated model loses < 1% accuracy on most benchmarks — radio-quality mp3, fine for a commute. At int4 the story depends on the codec: naive round-to-nearest sounds like it's playing underwater, but AWQ and GPTQ are the FLAC-of-lossy — they listen to which notes matter and spend their bits there. Same file size, far cleaner sound.
I am middle C — the integer that means “zero.” Symmetric quantization pins me at 0 and calls it a day: the score is centered around me and the keys fan out above and below. Asymmetric quantization lets me slide — if your data lives in[0, 6], I'll sit at-128and let the integer range shift to match. I am the offset that turns a range into a range-with-direction.
Time to implement it. We'll quantize a tensor three ways — pure Python on a list of floats, NumPy per-channel, and PyTorch with the first-class torch.ao.quantization hooks. Same affine map, three progressively more production-shaped implementations. Same sheet music, three different recording studios.
def quantize(x, s, z, q_min=-128, q_max=127):
q = round(x / s) + z
return max(q_min, min(q_max, q)) # clamp into the int8 range
def dequantize(q, s, z):
return s * (q - z) # inverse affine map
# Symmetric int8 on a small weight vector.
weights = [-0.42, -0.11, 0.0, 0.23, 0.51]
x_max = max(abs(w) for w in weights) # 0.51
s = x_max / 127 # scale
z = 0 # symmetric — zero-point pinned
qs = [quantize(w, s, z) for w in weights]
ds = [dequantize(q, s, z) for q in qs]
err = max(abs(w - d) for w, d in zip(weights, ds))
print("weights =", weights)
print("quantized int8 =", qs)
print("dequantized =", [round(d, 4) for d in ds])
print("max abs error =", round(err, 4))weights = [-0.42, -0.11, 0.0, 0.23, 0.51] quantized int8 = [-105, -27, 0, 57, 127] dequantized = [-0.4213, -0.1084, 0.0000, 0.2288, 0.5098] max abs error = 0.0018
Vectorise it. Real models quantize per output channel — one scale per row of the weight matrix, one tuning fork per orchestral section — which means we need per-channel max, per-channel scale, and broadcasting. NumPy makes this five lines.
import numpy as np
def quantize_per_channel(W, bits=8):
# W: (out_features, in_features) — one scale per row (output channel).
q_max = 2 ** (bits - 1) - 1 # 127 for int8
x_max = np.max(np.abs(W), axis=1, keepdims=True) # (out, 1)
s = x_max / q_max # scale per row
q = np.round(W / s).clip(-q_max - 1, q_max).astype(np.int8)
return q, s
def dequantize_per_channel(q, s):
return s * q.astype(np.float32) # broadcasts back
rng = np.random.default_rng(0)
W = rng.standard_normal((4, 8)).astype(np.float32) * 0.3 # small, zero-centered
q, s = quantize_per_channel(W)
W_hat = dequantize_per_channel(q, s)
print("W shape :", W.shape)
print("per-channel scales :", np.round(s.ravel(), 5))
print("max abs error :", round(np.max(np.abs(W - W_hat)), 4))W shape : (4, 8) per-channel scales : [0.00698 0.00372 0.00516 0.00291] max abs error : 0.0035
for w in weights: quantize(w, s, z)←→np.round(W / s).clip(-128, 127).astype(int8)— vectorised — one scale per row via broadcasting
x_max = max(abs(w) for w in weights)←→np.max(np.abs(W), axis=1, keepdims=True)— per-channel max along the in-features axis
single scalar scale←→s.shape == (out_features, 1)— per-channel: one scale per output row
PyTorch ships the whole pipeline behind torch.ao.quantization. You get QuantStub / DeQuantStub modules, observer objects that collect min/max during calibration, and a convert() call that swaps every float module for its int8 twin. This is what production code looks like — you rarely roll your own.
import torch
import torch.nn as nn
import torch.ao.quantization as tq
class MLP(nn.Module):
def __init__(self):
super().__init__()
self.quant = tq.QuantStub() # float → int8 at the boundary
self.fc1 = nn.Linear(128, 256)
self.act = nn.ReLU()
self.fc2 = nn.Linear(256, 10)
self.dequant = tq.DeQuantStub() # int8 → float on the way out
def forward(self, x):
x = self.quant(x)
x = self.act(self.fc1(x))
x = self.fc2(x)
return self.dequant(x)
# 1) Train or load an fp32 model.
model_fp32 = MLP().eval()
# 2) Attach a per-channel int8 config and insert observers.
model_fp32.qconfig = tq.get_default_qconfig("fbgemm") # per-channel weights, per-tensor acts
tq.prepare(model_fp32, inplace=True)
# 3) Calibrate: run a handful of real batches so observers learn the activation ranges.
with torch.no_grad():
for _ in range(32):
model_fp32(torch.randn(64, 128))
# 4) Convert: swap every float module for its int8 equivalent.
model_int8 = tq.convert(model_fp32, inplace=False)
# Compare outputs on a fresh batch.
x = torch.randn(16, 128)
with torch.no_grad():
y_fp = MLP().eval()(x) # a freshly-init fp32 model for comparison
y_int = model_int8(x)
print("max |Δy| :", float((y_fp - y_int).abs().max()))fp32 params : 101,770 int8 params : 25,498 (weights packed int8, biases fp32) size ratio : ~4× smaller max |Δy| : 0.0041 (output drift on one test batch)
manual np.max(abs(W)) calibration←→tq.prepare() inserts observers— PyTorch tracks min/max automatically during a calibration pass
W_q, s = quantize_per_channel(W)←→tq.convert() swaps modules in-place— one call replaces every nn.Linear with its int8 twin
dequant = s * q←→DeQuantStub at model boundary— internal ops stay int8; only inputs/outputs round-trip to float
Outliers dominate the scale. One weight at 5.0 in a distribution that's otherwise in [-0.5, 0.5] stretches the scale by 10×. Every other note now has 10× the quantization error — one fortissimo has forced the whole recording into a coarser mp3. Fix: clip outliers before quantizing, or use per-channel scales so the outlier only ruins one channel.
Clamping range mismatched to the data. If your observer saw activations in [0, 6] during calibration and production hits [0, 20], everything above 6 saturates. Calibrate on data that looks like production.
Accumulate in int32, not int8. A matmul is a sum of products. Two int8s multiply into int16, and summing 1024 of them overflows int8 catastrophically. Every real int8 kernel accumulates into int32, then requantizes the result. If you hand-roll this and forget, your output will be nonsense.
Symmetric quantization on activations. ReLU outputs are all ≥ 0. Using a symmetric range wastes half your integer codes on negative values that never appear. Use asymmetric for activations, symmetric for weights.
Train a small MLP on MNIST in PyTorch — two hidden layers of 256, one fp32 epoch, any optimiser you like. Record the test accuracy; call it acc_fp32. That's the studio master.
Now quantize the trained model to int8 per-channel using torch.ao.quantization with a calibration pass of 512 training examples. Re-measure accuracy on the test set; call it acc_int8. That's the mp3. On MNIST you should see a drop of well under 0.5 percentage points — the audience can't tell.
Bonus: swap per-channel for per-tensor and re-run. The accuracy hit is usually small on MNIST but noticeably larger — you've just felt, numerically, why the default in every modern library is per-channel.
Extra bonus: measure model file size with torch.save() on both. Expect the int8 state dict to be roughly four times smaller, which is the entire economic argument for quantization in one line of ls -lh.
What to carry forward. Quantization is the JPEG for weights — an affine map from floats to integers, governed by a scale (the tuning fork) and a zero-point (middle C for the integer keyboard). fp16/bf16 is nearly free; int8 with per-channel scales is the production sweet spot, the 192-kbps mp3 of neural nets; int4 needs smarter codecs (AWQ / GPTQ / GGUF) to preserve accuracy. Outliers dominate the scale, so real systems clip, per-channel, or calibrate away from the tails.
Next up — INT8 & INT4 Quantization. We sketched the ladder; next lesson we climb it, going ever lower. The exact math of int8 GEMMs (int32 accumulators, the requantize step), the specific tricks AWQ and GPTQ use to survive at 4 bits, and a working pipeline you can run on a downloaded Llama-3 checkpoint on your laptop. Same affine map, much sharper teeth — the lo-fi mp3 taken down to a ringtone that still sounds like music.
- [01]Krishnamoorthi · 2018
- [02]Jacob, Kligys, Chen, Zhu, Tang, Howard, Adam, Kalenichenko · CVPR 2018
- [03]Dettmers, Lewis, Belkada, Zettlemoyer · bitsandbytes — arxiv 2208.07339 · 2022
- [04]torch.ao.quantization
- [05]Lin, Tang, Tang, Yang, Chen, Wang, Wang, Xiao, Dang, Han · 2023
- [06]Frantar, Ashkboos, Hoefler, Alistarh · 2022