Quality/Performance Knobs: Expert Count And Hybrid Quantization

Issue 77 Edition 2026-03-18 7 min read

General

Sources: 1 • Confidence: High • Updated: 2026-03-25 17:54

Key takeaways

The post expresses uncertainty about output quality impacts, noting that a claim that 2-bit quantization is indistinguishable from 4-bit is supported by only thinly described evaluations.
Dan Woods reportedly ran a custom Qwen3.5-397B-A17B variant at over 5.5 tokens per second on a 48GB MacBook Pro M3 Max, despite the model being about 209GB on disk (about 120GB quantized).
Dan Woods reportedly used Claude Code with an autoresearch-style workflow to run about 90 experiments and generate MLX Objective-C and Metal code optimized for efficiency.
A switch to 4-bit quantization was motivated by a finding that the 2-bit version broke tool calling while 4-bit handled tool calling well.
Because Qwen3.5-397B-A17B is a Mixture-of-Experts model, only a subset of expert weights is needed per token and those experts can be streamed from SSD into memory on demand rather than kept resident in RAM.

The post expresses uncertainty about output quality impacts, noting that a claim that 2-bit quantization is indistinguishable from 4-bit is supported by only thinly described evaluations.
A switch to 4-bit quantization was motivated by a finding that the 2-bit version broke tool calling while 4-bit handled tool calling well.
In the described setup, expert weights are quantized to 2-bit while non-expert components such as embeddings and routing matrices remain at original precision, totaling about 5.5GB resident in memory during inference.
In the described implementation, the number of experts used per token was reduced from Qwen 3.5's usual 10 to 4.
The post claims that the largest quality drop occurred at 3 experts per token (relative to higher expert counts).

Dan Woods reportedly ran a custom Qwen3.5-397B-A17B variant at over 5.5 tokens per second on a 48GB MacBook Pro M3 Max, despite the model being about 209GB on disk (about 120GB quantized).
Because Qwen3.5-397B-A17B is a Mixture-of-Experts model, only a subset of expert weights is needed per token and those experts can be streamed from SSD into memory on demand rather than kept resident in RAM.
Dan Woods reportedly used techniques described in Apple's 2023 paper "LLM in a flash: Efficient Large Language Model Inference with Limited Memory."
The cited Apple method runs models that exceed DRAM by storing parameters in flash and moving them into DRAM on demand using a cost model that reduces transfer volume and favors larger contiguous reads.

Dan Woods reportedly used Claude Code with an autoresearch-style workflow to run about 90 experiments and generate MLX Objective-C and Metal code optimized for efficiency.
The repository danveloper/flash-moe reportedly contains the resulting code and a PDF paper (mostly written by Claude Opus 4.6) documenting the experiment.

What exact benchmarking protocol produced the reported tokens-per-second (prompt length, generation length, warm vs cold cache, first-token latency, inclusion/exclusion of paging and compilation overhead)?
What is the task-level quality impact of reducing experts-per-token (including comparisons across 3/4/6/10 experts) using standardized evaluations and disclosed prompts/datasets?
How does 2-bit vs 4-bit (and hybrid precision choices) affect not just general text quality but structured reliability (tool calling, JSON/schema adherence, function selection) across a broad suite of tools and schemas?
How much of the memory footprint and performance comes from the Apple-style flash paging cost model versus MoE sparsity alone, and what are the ablation results?
What hardware dependencies exist (SSD throughput/latency, thermals, power limits) and how well does the approach generalize across different machines with different storage and memory hierarchies?

MoE sparsity plus flash streamed weights could enable running oversized models on consumer devices, creating demand for inference stacks optimized for SSD to DRAM behavior and contiguous reads.
Aggressive quantization can cause functional regressions such as broken tool calling, implying demand for higher reliability quantization schemes and evaluation suites focused on structured outputs.
AI assisted systems optimization workflows may accelerate performance engineering, increasing the value of tooling and frameworks that support rapid experiment loops and reproducible artifacts.

Reproducible benchmarks detailing prompt and generation lengths, cache state, first token latency, and inclusion of paging and compilation overhead, showing similar tokens per second across machines.
Standardized evaluations showing expert count and hybrid precision tradeoffs, including tool calling, JSON or schema adherence, and function selection, with 4 bit mitigating regressions seen at 2 bit.
Ablation results separating gains from MoE sparsity versus flash paging cost model, along with hardware sensitivity across SSD throughput, thermals, and power limits.

Independent replication fails to reproduce throughput once paging, cache warmup, or compilation overhead are included, or performance collapses on different storage and memory hierarchies.
Broader structured reliability testing shows 4 bit and hybrid precision still fails tool calling or schema adherence, making the approach unusable for real workflows despite good text quality.
Ablations show most gains are specific to one platform cost model or storage behavior, with limited generalization beyond the reported machine configuration.