Rosa Del Mar

Daily Brief

Issue 77 2026-03-18

Quality/Performance Knobs: Expert Count And Hybrid Quantization

Issue 77 Edition 2026-03-18 7 min read
General
Sources: 1 • Confidence: High • Updated: 2026-03-25 17:54

Key takeaways

  • The post expresses uncertainty about output quality impacts, noting that a claim that 2-bit quantization is indistinguishable from 4-bit is supported by only thinly described evaluations.
  • Dan Woods reportedly ran a custom Qwen3.5-397B-A17B variant at over 5.5 tokens per second on a 48GB MacBook Pro M3 Max, despite the model being about 209GB on disk (about 120GB quantized).
  • Dan Woods reportedly used Claude Code with an autoresearch-style workflow to run about 90 experiments and generate MLX Objective-C and Metal code optimized for efficiency.
  • A switch to 4-bit quantization was motivated by a finding that the 2-bit version broke tool calling while 4-bit handled tool calling well.
  • Because Qwen3.5-397B-A17B is a Mixture-of-Experts model, only a subset of expert weights is needed per token and those experts can be streamed from SSD into memory on demand rather than kept resident in RAM.

Sections

Quality/Performance Knobs: Expert Count And Hybrid Quantization

  • The post expresses uncertainty about output quality impacts, noting that a claim that 2-bit quantization is indistinguishable from 4-bit is supported by only thinly described evaluations.
  • A switch to 4-bit quantization was motivated by a finding that the 2-bit version broke tool calling while 4-bit handled tool calling well.
  • In the described setup, expert weights are quantized to 2-bit while non-expert components such as embeddings and routing matrices remain at original precision, totaling about 5.5GB resident in memory during inference.
  • In the described implementation, the number of experts used per token was reduced from Qwen 3.5's usual 10 to 4.
  • The post claims that the largest quality drop occurred at 3 experts per token (relative to higher expert counts).

Flash-Streamed Limited-Memory Inference For Oversized Models

  • Dan Woods reportedly ran a custom Qwen3.5-397B-A17B variant at over 5.5 tokens per second on a 48GB MacBook Pro M3 Max, despite the model being about 209GB on disk (about 120GB quantized).
  • Because Qwen3.5-397B-A17B is a Mixture-of-Experts model, only a subset of expert weights is needed per token and those experts can be streamed from SSD into memory on demand rather than kept resident in RAM.
  • Dan Woods reportedly used techniques described in Apple's 2023 paper "LLM in a flash: Efficient Large Language Model Inference with Limited Memory."
  • The cited Apple method runs models that exceed DRAM by storing parameters in flash and moving them into DRAM on demand using a cost model that reduces transfer volume and favors larger contiguous reads.

Ai-Assisted Performance Engineering Workflow And Reproducibility Artifacts

  • Dan Woods reportedly used Claude Code with an autoresearch-style workflow to run about 90 experiments and generate MLX Objective-C and Metal code optimized for efficiency.
  • The repository danveloper/flash-moe reportedly contains the resulting code and a PDF paper (mostly written by Claude Opus 4.6) documenting the experiment.

Unknowns

  • What exact benchmarking protocol produced the reported tokens-per-second (prompt length, generation length, warm vs cold cache, first-token latency, inclusion/exclusion of paging and compilation overhead)?
  • What is the task-level quality impact of reducing experts-per-token (including comparisons across 3/4/6/10 experts) using standardized evaluations and disclosed prompts/datasets?
  • How does 2-bit vs 4-bit (and hybrid precision choices) affect not just general text quality but structured reliability (tool calling, JSON/schema adherence, function selection) across a broad suite of tools and schemas?
  • How much of the memory footprint and performance comes from the Apple-style flash paging cost model versus MoE sparsity alone, and what are the ablation results?
  • What hardware dependencies exist (SSD throughput/latency, thermals, power limits) and how well does the approach generalize across different machines with different storage and memory hierarchies?

Investor overlay

Read-throughs

  • MoE sparsity plus flash streamed weights could enable running oversized models on consumer devices, creating demand for inference stacks optimized for SSD to DRAM behavior and contiguous reads.
  • Aggressive quantization can cause functional regressions such as broken tool calling, implying demand for higher reliability quantization schemes and evaluation suites focused on structured outputs.
  • AI assisted systems optimization workflows may accelerate performance engineering, increasing the value of tooling and frameworks that support rapid experiment loops and reproducible artifacts.

What would confirm

  • Reproducible benchmarks detailing prompt and generation lengths, cache state, first token latency, and inclusion of paging and compilation overhead, showing similar tokens per second across machines.
  • Standardized evaluations showing expert count and hybrid precision tradeoffs, including tool calling, JSON or schema adherence, and function selection, with 4 bit mitigating regressions seen at 2 bit.
  • Ablation results separating gains from MoE sparsity versus flash paging cost model, along with hardware sensitivity across SSD throughput, thermals, and power limits.

What would kill

  • Independent replication fails to reproduce throughput once paging, cache warmup, or compilation overhead are included, or performance collapses on different storage and memory hierarchies.
  • Broader structured reliability testing shows 4 bit and hybrid precision still fails tool calling or schema adherence, making the approach unusable for real workflows despite good text quality.
  • Ablations show most gains are specific to one platform cost model or storage behavior, with limited generalization beyond the reported machine configuration.

Sources