Limited-Memory Local Inference Via Flash/Ssd Streaming For Moe Models
Sources: 1 • Confidence: High • Updated: 2026-04-13 03:51
Key takeaways
- Dan Woods reportedly ran a custom Qwen3.5-397B-A17B variant at over 5.5 tokens per second on a 48GB MacBook Pro M3 Max, despite the model being about 209GB on disk (about 120GB quantized).
- The corpus asserts that uncertainty remains about output-quality impact because the evaluations supporting a claim that 2-bit is indistinguishable from 4-bit are thinly described.
- Dan Woods reportedly used Claude Code with an autoresearch-style workflow to run about 90 experiments and generate MLX Objective-C and Metal code optimized for efficiency.
- In the described work, moving from 2-bit to 4-bit quantization was motivated by a finding that the 2-bit version broke tool calling while the 4-bit version handled tool calling well.
- Because Qwen3.5-397B-A17B is a Mixture-of-Experts model, only a subset of expert weights is needed per token, enabling expert weights to be streamed from SSD into memory on demand rather than kept resident in RAM.
Sections
Limited-Memory Local Inference Via Flash/Ssd Streaming For Moe Models
- Dan Woods reportedly ran a custom Qwen3.5-397B-A17B variant at over 5.5 tokens per second on a 48GB MacBook Pro M3 Max, despite the model being about 209GB on disk (about 120GB quantized).
- Because Qwen3.5-397B-A17B is a Mixture-of-Experts model, only a subset of expert weights is needed per token, enabling expert weights to be streamed from SSD into memory on demand rather than kept resident in RAM.
- Dan Woods reportedly used techniques described in Apple's 2023 paper "LLM in a flash: Efficient Large Language Model Inference with Limited Memory."
- The cited Apple method runs models that exceed DRAM capacity by storing parameters in flash and moving them into DRAM on demand using a cost model to reduce transfer volume and favor larger contiguous reads.
- In the described setup, expert weights are quantized to 2-bit while non-expert components such as embeddings and routing matrices remain at original precision, totaling about 5.5GB resident in memory during inference.
- In the described implementation, the number of experts used per token was reduced from a usual value of 10 to 4.
Quality Sensitivity To Aggressive Quantization And Expert-Routing Choices (Tool Calling As Canary)
- The corpus asserts that uncertainty remains about output-quality impact because the evaluations supporting a claim that 2-bit is indistinguishable from 4-bit are thinly described.
- In the described work, moving from 2-bit to 4-bit quantization was motivated by a finding that the 2-bit version broke tool calling while the 4-bit version handled tool calling well.
- In the described implementation, the number of experts used per token was reduced from a usual value of 10 to 4.
- It is claimed that the largest quality drop (under expert-count reduction) occurred when using 3 experts per token.
Ai-Assisted Systems Optimization Workflow
- Dan Woods reportedly used Claude Code with an autoresearch-style workflow to run about 90 experiments and generate MLX Objective-C and Metal code optimized for efficiency.
- The repository danveloper/flash-moe reportedly contains the resulting code and a PDF paper documenting the experiment, with the PDF mostly written by Claude Opus 4.6.
Unknowns
- What exact benchmark methodology and runtime settings produced the reported >5.5 tokens/sec result (including prompt length, caching behavior, sampling parameters, batch size, and whether the model was warmed up)?
- What are the precise quantization formats and implementation details used (including how 2-bit/4-bit were applied, per-layer or per-tensor choices, and how non-expert components were kept at higher precision)?
- What standardized evaluation suite, prompts, datasets, and scoring were used to assess quality, including the claim about where the largest quality drop occurs when varying experts-per-token?
- How robust is tool calling across multiple tools and schema complexities for 2-bit vs 4-bit (and across different expert counts), and what is the failure mode when it breaks?
- What are the latency distributions and I/O characteristics (e.g., SSD read sizes, contiguity, stalls) during inference, and how sensitive are results to SSD performance?