Outliers As A Quantization Bottleneck And Failure Mode
Sources: 1 • Confidence: Medium • Updated: 2026-03-27 10:09
Key takeaways
- The root cause of quantization-relevant outlier weights is not conclusively known.
- Quantization accuracy impact can be evaluated using perplexity and KL divergence alongside benchmark runs such as GPQA, as demonstrated with llama.cpp tooling on Qwen 3.5 9B across quantization levels.
- Sam Rose published an interactive essay explaining how quantization of large language models works.
- One practical outlier-handling approach in quantization is to leave outlier values unquantized so they do not degrade an entire quantization block.
- Some individual outlier weights are critical enough that removing a single one can cause a model to produce nonsensical output.
Sections
Outliers As A Quantization Bottleneck And Failure Mode
- The root cause of quantization-relevant outlier weights is not conclusively known.
- Some individual outlier weights are critical enough that removing a single one can cause a model to produce nonsensical output.
- Rare outlier weight values can significantly affect quantization outcomes compared with the bulk of small-magnitude weights.
Measurement Approach For Quantization Tradeoffs
- Quantization accuracy impact can be evaluated using perplexity and KL divergence alongside benchmark runs such as GPQA, as demonstrated with llama.cpp tooling on Qwen 3.5 9B across quantization levels.
- Moving from 16-bit to 8-bit quantization is reported to produce almost no model quality penalty.
- Moving from 16-bit to 4-bit quantization is reported to be more noticeable but may retain roughly 90% of original quality depending on the metric.
New Quantization Explainer Resource
- Sam Rose published an interactive essay explaining how quantization of large language models works.
- The essay contains a notably clear visual explanation of how floating-point numbers are represented in binary.
Outlier-Preserving Quantization Schemes
- One practical outlier-handling approach in quantization is to leave outlier values unquantized so they do not degrade an entire quantization block.
- Another practical outlier-handling approach is to store outlier positions and values separately from the main quantized representation to protect block quality.
Unknowns
- Under what architectures, layers, and training regimes do quantization-relevant outlier weights arise, and what mechanisms produce them?
- How reproducible is the claimed catastrophic behavior from removing a single critical outlier weight across models and tasks?
- What are the exact boundary conditions for the reported minimal quality loss from 16-bit to 8-bit quantization (model type, tasks, safety behaviors, and sampling settings)?
- What does 'roughly 90% quality' at 4-bit mean across different quality axes (task accuracy, human preference, safety), and how does it vary with metric choice?
- Which outlier-handling technique (exempting outliers vs separate storage) yields better quality-latency-memory tradeoffs under different deployment constraints?