INT8 & INT4 Quantization

Post-training quantization and QAT, in detail.

Hard
~15 min read
·lesson 2 of 5

The question on the table is simple and slightly reckless: how low can you go? Picture a climber on a cliff face, roped in at the top and working their way down. At every ledge they ask the same question — do I drop another rung, or do I stop here? Each rung down is one bit dropped from the weight. The view gets cheaper — smaller file, less memory, more throughput — but the holds get sketchier. Step too low and the rock crumbles. That's this lesson.

A fp16 Llama-70B weight file is 140 GB. That does not fit on your GPU, or your friend's GPU, or probably any GPU you've ever met. Drop to int8 and it's 70 GB — tight but plausible on a single A100. Drop to int4 and it's 35 GB, which fits on a single consumer 4090 with room for a KV cache and a browser tab. Two rungs lower on the cliff, same model still talking.

Quantization is the art of storing weights (and sometimes activations) in fewer bits than the training pipeline used. Fewer bits means less memory, less bandwidth, and — on hardware that knows what to do with integers — more throughput. The catch is that 8-bit integers have 256 possible values, 4-bit integers have 16, and a trained transformer has weights that really, really want to be 16-bit floats. Squeeze them wrong and the model produces word salad.

This lesson is about squeezing them right. We'll cover the two families of quantization (PTQ, cheap and fast; QAT, expensive and best). Then the three algorithms that actually show up in production today: LLM.int8() for the outlier problem, GPTQ for Hessian-based error compensation, and AWQ for activation-aware scaling. By the end, turning a 140 GB model into a 35 GB model should feel routine rather than alarming.

Quantization (personified)
I round your weights to the nearest integer. That's it. The interesting part is that if you round naively — uniform steps, no per-channel scales, no outlier protection — your 70B model becomes a random n-gram generator. Everything below is how to round without breaking things.

Two philosophies. You can either quantize after training (treat it as a compression step applied to a finished model) or quantize during training (make the network learn with the quantization error already baked in).

  • Post-Training Quantization (PTQ) takes a trained fp16 checkpoint, runs a tiny calibration pass over ~256 samples to estimate activation ranges, and writes out an int8 or int4 version. No gradient updates. Minutes to hours. This is what LLM.int8(), GPTQ, and AWQ all are — PTQ methods with increasingly clever tricks.
  • Quantization-Aware Training (QAT) inserts fake-quantize ops into the forward pass during training. The network sees the rounding error on every step and learns weights that survive it. Best quality, roughly the cost of another training run. Rare for 70B-scale LLMs because nobody wants to re-pretrain.

For large language models PTQ won by default — the pretraining bill is too big to redo. The rest of this lesson is PTQ.

Back to the cliff. The climber stepped down from fp16 to int8 — roughly a shoulder-width drop to a solid ledge. Most models land here with negligible loss and nobody thinks about it again. That's the good news. The bad news: it took a specific trick to make that ledge stable, and the trick comes from Dettmers 2022. If you look at the activation tensors inside a 6B+ transformer, a handful of hidden dimensions have values roughly 20× larger than everything else. They look like spikes poking out of a flat plain. Uniformly quantizing that tensor to int8 means the 256 levels get stretched to cover the spikes — and the flat plain, which is 99% of the information, gets squashed into three or four levels. The model is now rounding the signal, not the noise.

The fix is called mixed-precision decomposition. Identify the outlier columns (say, any column whose max absolute activation exceeds 6.0), pull them out, run them in fp16. Run the other 99% of columns in int8. Concatenate the results at the end.

LLM.int8() — outlier decomposition
Y  =  X · Wᵀ

     =  X_fp16[:, O]  ·  W_fp16[:, O]ᵀ          ← outlier columns, full precision
        +
        dequant( X_int8[:, R] · W_int8[:, R]ᵀ, s_X · s_W )   ← the rest, int8

where  O = { i :  max |X[:, i]|  >  6.0 }
       R = { 0, …, d } \ O
       s_X, s_W  =  per-row, per-column scaling factors
outliers and quantization — toggle strategy, watch the MSE collapse
d = 200·12 · outlier channels: 3, 8
original weights (cyan) · dequantized (amber) — naive per-tensor
original (fp32)int8 round-trip-4-2024
channel magnitudes (max |w| per column)
mse by strategy
naive per-tensor9.68e-5
per-channel1.59e-5
mixed-precision1.53e-6
SmoothQuant1.59e-5
one scale for every weight. A single 4.0 outlier forces scale ≈ 4/127, which quantizes the 99% of small weights to just a handful of codes. MSE is dominated by lost precision.
mse9.68e-5

Drag the threshold. At 6.0 you route ~0.1% of the columns to fp16 and the model is within 1% perplexity of the fp16 baseline. Lower the threshold and more columns go to fp16 — higher quality, less compression. Raise it and the int8 path has to absorb the spikes, which corrupts the quantization grid for everything else. The sweet spot is narrow and worth finding.

LLM.int8() is the default load_in_8bit=True flag in bitsandbytes. It is the reason you can run a 175B model on a 4-way A100 node with no retraining, no calibration, and no perplexity you'd notice.

Outlier channel (personified)
I'm one hidden dimension out of 12,288 and I run 20× hotter than my peers. Transformers love me — I carry position-y, attention-sink-y, BOS-ish information the rest of the network relies on. Round me to int8 and you flatten me. Round me to fp16 and we all live. Give me the floats.

LLM.int8() keeps the int8 ledge stable. It does not help when you want to keep dropping — at 4 bits, the quantization grid is so coarse that even well-behaved weights lose information the network actually needed. This is the fingertip crack: the climber can hang here, but they need to place every hold deliberately. You need to compensate for the rounding error as you go.

GPTQ (Frantar et al. 2023) does this one layer at a time. Quantize column j of the weight matrix. That introduces an error in the layer's output. Distribute that error across the remaining unquantized columns by nudging them slightly — using the layer's Hessian to figure out which nudges hurt the output least. Then quantize the next column. By the time every column is quantized, the accumulated error has been partially absorbed by weights that hadn't been touched yet.

GPTQ — Hessian-based update for the remaining columns
Layer minimises    L(W)  ≈  ‖ X W  −  X Ŵ ‖²

Hessian            H     =  2 · Xᵀ X

For each column j (in order):
  1. quantize:      ŵ_j   =  quant(w_j)
  2. error:         e_j   =  (w_j  −  ŵ_j) / [H⁻¹]_{j,j}
  3. update rest:   w_k  ←  w_k  −  e_j · [H⁻¹]_{j,k}     for k > j
  4. mark j done, move on.
GPTQ column-by-column — Hessian-ordered, error-compensated
d = 12 · int4 · W[:, j:] += err · H⁻¹[j, j:]/H⁻¹[j, j]
W — amber = fp, green = int4 · ring = next column
0123456789101101234567891011
1 / H⁻¹[j, j] — higher = more sensitive, quantized first
column order (descending sensitivity)
#1col 8err = 0.00e+0next
#2col 5err = 0.00e+0
#3col 7err = 0.00e+0
#4col 9err = 0.00e+0
#5col 2err = 0.00e+0
#6col 10err = 0.00e+0
#7col 3err = 0.00e+0
#8col 6err = 0.00e+0
#9col 1err = 0.00e+0
#10col 0err = 0.00e+0
#11col 4err = 0.00e+0
#12col 11err = 0.00e+0
step0 / 12
running mse0.00e+0
columns done0 / 12

Step through column-by-column. Notice how every time you quantize a column, the still-unquantized columns (to the right) shift slightly. Those shifts are the error compensation — the algorithm is pre-baking future quantization errors into the current weights so the layer's overall output doesn't drift. At the end the weights are all int4 and the layer output barely moved.

Two things to internalize. First, GPTQ is per-layer — the Hessian is estimated from calibration data passed through up to that layer, not the whole network. Second, inverting the Hessian is the expensive part, but you only do it once per layer, and there's a Cholesky trick that makes it tractable for d ≈ 12,000 dimensions. Frantar's implementation quantizes a 175B model in about four GPU-hours.

GPTQ (personified)
I'm a greedy algorithm with regret. Every column I quantize, I apologise to the columns I haven't touched yet by adjusting them. By the time I finish the matrix, the error I introduced early has been absorbed by edits I made later. The Hessian tells me how to apologise most efficiently.

Time for the bit-budget reveal — the whole cliff face laid out, rung by rung. Here's what each drop actually costs and buys.

  • fp16 / bf16 — the top of the cliff. 2 bytes per weight, no quality loss because this is what the model was trained in. 16 bits; 65,536 possible values per weight.
  • int8 — one rung down, a wide solid ledge. 2× smaller than fp16, nearly lossless with LLM.int8(). 256 possible values per weight. Every modern GPU has native int8 Tensor Cores. The no-brainer default.
  • int4 — another rung lower, a fingertip crack. 4× smaller than fp16. 16 possible values per weight. Needs GPTQ or AWQ to keep quality; typical perplexity hit is 0.1–0.3 on WikiText. The standard for consumer and local inference.
  • int3, int2, binary — free-solo territory. 8 values, 4 values, 2 values. Quality degrades sharply, and consumer hardware doesn't have a fast kernel for 3-bit anything. You'll see papers but not production deployments. It works — if you know what you're doing, and if you accept that the bottom of the cliff is a different sport.

In 2024–2026 serving reality: fp16 is the training/baseline, int8 is standard for production inference, and int4 (GPTQ, AWQ, or GGUF) is common for anything that has to run on a laptop or a single consumer card. GGUF in particular — the format used by llama.cpp — has a zoo of mixed-precision schemes (Q4_K_M, Q5_K_S, Q8_0, etc.) that pack different bit widths for different tensors in the same file.

All of these PTQ methods — LLM.int8(), GPTQ, AWQ — need calibration data. Not to train anything, just to estimate activation ranges and (for GPTQ) the layer Hessian. Typical size is 128–512 samples. The dataset matters more than the size: calibrate a code model on Wikipedia and you'll get a model that quantizes its most confident domain badly.

In practice: use a subset of the pretraining mix, or a held-out chunk of your target inference distribution. C4 and WikiText are the standard defaults for research; for production, whatever you'll actually be serving is better.

Three layers. A numpy GPTQ sketch on a single linear layer, to see the update rule working on something tiny. Then bitsandbytes for LLM.int8() in one flag. Then auto-gptq for real int4 quantization of a transformer.

layer 1 — numpy · gptq_sketch.py
python
import numpy as np

def quantize_int4(w, scale):
    q = np.clip(np.round(w / scale), -8, 7)       # 4-bit signed, [-8, 7]
    return q * scale                              # dequantized fp value

# Single linear layer — X @ W, shapes (N, d) × (d, d)
np.random.seed(0)
N, d = 256, 64
X = np.random.randn(N, d).astype(np.float32)
W = np.random.randn(d, d).astype(np.float32) * 0.1

# Per-column scale (standard int4 choice).
scale = np.max(np.abs(W), axis=0) / 7

# --- naive int4: just round every column independently ---------------
W_naive = np.stack([quantize_int4(W[:, j], scale[j]) for j in range(d)], axis=1)

# --- GPTQ: quantize column-by-column, compensate with Hessian -------
H    = X.T @ X + 1e-3 * np.eye(d)                 # (d, d), SPD
Hinv = np.linalg.inv(H)
W_q  = W.copy()
for j in range(d):
    w_j     = W_q[:, j].copy()
    w_q_j   = quantize_int4(w_j, scale[j])
    err     = (w_j - w_q_j) / Hinv[j, j]
    W_q[:, j] = w_q_j
    # distribute error to remaining columns via Hessian row
    W_q[:, j+1:] -= np.outer(err, Hinv[j, j+1:])

def frob(A): return np.linalg.norm(A)
print(f"fp16 output norm : {frob(X @ W):.4f}")
print(f"int4 naive  norm : {frob(X @ W_naive):.4f}   "
      f"(error = {frob(X @ (W - W_naive)):.4f})")
print(f"int4 gptq   norm : {frob(X @ W_q):.4f}   "
      f"(error = {frob(X @ (W - W_q)):.4f})")
stdout
fp16 output norm : 4.8213
int4 naive  norm : 5.1740   (error = 0.3527)
int4 gptq   norm : 4.8301   (error = 0.0088)

Naive int4 rounding blows the output norm off by ~7%. GPTQ's column-by-column compensation brings it back to within 0.2%. Same bit budget, same weights, much better model — purely because we spread the rounding error across columns that hadn't been rounded yet.

Now the production version. bitsandbytes takes any HuggingFace model and converts it to LLM.int8() in one flag. No calibration code, no custom kernels.

layer 2 — pytorch + bitsandbytes · load_int8.py
python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "meta-llama/Llama-2-7b-hf"

# LLM.int8() — outlier decomposition handled internally.
# 7B model drops from ~13 GB to ~7 GB of VRAM.
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    load_in_8bit=True,                  # the whole trick
    device_map="auto",                  # split across GPUs if needed
)
tok = AutoTokenizer.from_pretrained(model_id)

x = tok("Quantization is", return_tensors="pt").to(model.device)
out = model.generate(**x, max_new_tokens=20)
print(tok.decode(out[0]))

And int4 via auto-gptq. One calibration pass over 128 samples, then save a quantized checkpoint that loads like any other HuggingFace model.

layer 3 — pytorch + auto-gptq · quantize_int4.py
python
from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from datasets import load_dataset

model_id = "meta-llama/Llama-2-7b-hf"
out_dir  = "llama2-7b-gptq-int4"

qcfg = BaseQuantizeConfig(
    bits=4,                            # int4
    group_size=128,                    # per-group scales — standard choice
    desc_act=False,                    # column-order heuristic; False is faster
)

tok   = AutoTokenizer.from_pretrained(model_id)
model = AutoGPTQForCausalLM.from_pretrained(model_id, qcfg)

# ~128 calibration samples; c4 is the community default for GPTQ.
c4 = load_dataset("allenai/c4", "en", split="train", streaming=True)
calib = [tok(next(iter(c4))["text"][:2048], return_tensors="pt")
         for _ in range(128)]

model.quantize(calib)                  # Hessian + column-wise compensation inside
model.save_quantized(out_dir)          # 7B model → ~3.5 GB on disk
tok.save_pretrained(out_dir)
numpy sketch → bitsandbytes → auto-gptq
W_q[:, j+1:] -= np.outer(err, Hinv[j, j+1:])←→model.quantize(calib)

the loop lives inside — you supply data, it does the math

manual per-column scale←→group_size=128

production uses per-group scales — finer than per-tensor, cheaper than per-column

np.linalg.inv(H)←→internal Cholesky on block-diagonal H

naive inversion is O(d³); auto-gptq uses a Cholesky trick to scale

One more return to the cliff, because this is where naive int4 gets people killed. You've dropped a rung past the int8 ledge onto the fingertip crack, and the holds look fine. Then a single outlier weight flicks your boot off. Here's the failure mode, step by step.

Int4 has 16 buckets. Your per-tensor scale is set by the maximum absolute weight. In a well-behaved matrix that maximum is maybe 0.3, every bucket covers ~0.04, and most weights land with a quantization error well under a percent. Now add one outlier weight at 5.0. The maximum just jumped 16×. Every bucket now covers ~0.6. Your well-behaved weights, the ones doing the actual work, are all being rounded to either 0 or ±0.6. The signal is gone. The model is now a random text generator. One bit too low, one bad hold, and the climber is at the bottom of the cliff.

The three techniques from this lesson are three different ways to not fall. LLM.int8() pulls the outlier columns out and runs them in fp16 — the climber leaves their fingertips on the crack but clips the pro onto a higher anchor. GPTQ keeps rounding naively, then nudges the remaining weights to compensate — lower the boot, but shift your weight to a different hold on the way down. AWQ pre-scales the salient channels before they get rounded so they land on more of the int4 grid — chalk up before you even touch the rock. All three exist because naive int4 on a raw transformer doesn't work.

Gotchas

Calibration data mismatch: calibrate a code model on Wikipedia and its perplexity on code will crater. The activation ranges estimated during calibration are the only thing the quantizer knows about your data. Use samples from the actual inference domain.

Quantizing activations without outlier handling: works fine for CNNs and small transformers. For 6B+ LLMs the outlier channels break the grid and the model degenerates into repetition loops. If you're rolling your own int8 activation quantization, you must do mixed-precision decomposition.

Expecting speed gains without hardware support: int4 on a GPU that has no native int4 kernel is often slower than fp16, because every matmul dequantizes first. Know what your hardware actually accelerates: Ampere and newer do int8 natively; Hopper and newer do fp8; int4 speedups come from bespoke kernels (Marlin, ExLlama) not the standard libraries.

Not re-quantizing after fine-tuning: if you QLoRA-finetune an int4 base and then merge the adapters, the merged weights are fp16 again. Ship the int4 + adapter separately, or re-quantize after merging.

Int4 a 7B model and measure the damage

Quantize Llama-7B to int4 with GPTQ using the auto-gptq snippet above. Calibrate on 128 samples of C4. Save the checkpoint.

Load both the fp16 model and the int4 model. Compute perplexity on the first 40k tokens of WikiText-2 test split for each. Report:

  • fp16 perplexity
  • int4 perplexity (and the delta)
  • VRAM used at inference time, fp16 vs int4
  • tokens/second for a 256-token generation, fp16 vs int4

Expect a perplexity delta somewhere around +0.1 to +0.3, a VRAM drop of ~4×, and tokens/sec that depends heavily on whether you're using a kernel like Marlin or ExLlama. If your perplexity delta is above +1.0, your calibration set is probably mismatched — try a different subset.

What to carry forward. Quantization is rounding, and rounding is a problem when the thing being rounded has a long-tailed distribution. LLM.int8() handles that by routing the tails to fp16. GPTQ handles that by distributing rounding error across columns that haven't been touched yet, weighted by a Hessian that tells it which distribution hurts output least. AWQ handles that by scaling salient weights before quantizing so they land on more of the grid. All three need calibration data from the distribution you actually care about. All three are PTQ — no retraining required — which is why they won at the scale of modern LLMs. Int8 is the ledge, int4 is the crack, lower is a sport.

Next up — Speculative Decoding. Quantization makes the model smaller so the same GPU can hold more of it. Speculative decoding makes the model faster by having a tiny draft model propose several tokens at a time while the big model verifies them in parallel. A 2–3× throughput win with zero quality loss, because the big model still has the final say on every token. Same goal (“serve this model cheaper”), completely different mechanism — no more rungs to drop; instead, a faster climb.

References