Quality/Performance Knobs: Expert Count And Hybrid Quantization
Sources: 1 • Confidence: High • Updated: 2026-03-25 17:54
Key takeaways
- The post expresses uncertainty about output quality impacts, noting that a claim that 2-bit quantization is indistinguishable from 4-bit is supported by only thinly described evaluations.
- Dan Woods reportedly ran a custom Qwen3.5-397B-A17B variant at over 5.5 tokens per second on a 48GB MacBook Pro M3 Max, despite the model being about 209GB on disk (about 120GB quantized).
- Dan Woods reportedly used Claude Code with an autoresearch-style workflow to run about 90 experiments and generate MLX Objective-C and Metal code optimized for efficiency.
- A switch to 4-bit quantization was motivated by a finding that the 2-bit version broke tool calling while 4-bit handled tool calling well.
- Because Qwen3.5-397B-A17B is a Mixture-of-Experts model, only a subset of expert weights is needed per token and those experts can be streamed from SSD into memory on demand rather than kept resident in RAM.
Sections
Quality/Performance Knobs: Expert Count And Hybrid Quantization
- The post expresses uncertainty about output quality impacts, noting that a claim that 2-bit quantization is indistinguishable from 4-bit is supported by only thinly described evaluations.
- A switch to 4-bit quantization was motivated by a finding that the 2-bit version broke tool calling while 4-bit handled tool calling well.
- In the described setup, expert weights are quantized to 2-bit while non-expert components such as embeddings and routing matrices remain at original precision, totaling about 5.5GB resident in memory during inference.
- In the described implementation, the number of experts used per token was reduced from Qwen 3.5's usual 10 to 4.
- The post claims that the largest quality drop occurred at 3 experts per token (relative to higher expert counts).
Flash-Streamed Limited-Memory Inference For Oversized Models
- Dan Woods reportedly ran a custom Qwen3.5-397B-A17B variant at over 5.5 tokens per second on a 48GB MacBook Pro M3 Max, despite the model being about 209GB on disk (about 120GB quantized).
- Because Qwen3.5-397B-A17B is a Mixture-of-Experts model, only a subset of expert weights is needed per token and those experts can be streamed from SSD into memory on demand rather than kept resident in RAM.
- Dan Woods reportedly used techniques described in Apple's 2023 paper "LLM in a flash: Efficient Large Language Model Inference with Limited Memory."
- The cited Apple method runs models that exceed DRAM by storing parameters in flash and moving them into DRAM on demand using a cost model that reduces transfer volume and favors larger contiguous reads.
Ai-Assisted Performance Engineering Workflow And Reproducibility Artifacts
- Dan Woods reportedly used Claude Code with an autoresearch-style workflow to run about 90 experiments and generate MLX Objective-C and Metal code optimized for efficiency.
- The repository danveloper/flash-moe reportedly contains the resulting code and a PDF paper (mostly written by Claude Opus 4.6) documenting the experiment.
Unknowns
- What exact benchmarking protocol produced the reported tokens-per-second (prompt length, generation length, warm vs cold cache, first-token latency, inclusion/exclusion of paging and compilation overhead)?
- What is the task-level quality impact of reducing experts-per-token (including comparisons across 3/4/6/10 experts) using standardized evaluations and disclosed prompts/datasets?
- How does 2-bit vs 4-bit (and hybrid precision choices) affect not just general text quality but structured reliability (tool calling, JSON/schema adherence, function selection) across a broad suite of tools and schemas?
- How much of the memory footprint and performance comes from the Apple-style flash paging cost model versus MoE sparsity alone, and what are the ablation results?
- What hardware dependencies exist (SSD throughput/latency, thermals, power limits) and how well does the approach generalize across different machines with different storage and memory hierarchies?