Limited-Memory Local Inference Via Flash/Ssd Streaming For Moe Models
Sources: 1 • Confidence: Medium • Updated: 2026-04-12 10:17
Key takeaways
- A custom Qwen3.5-397B-A17B variant was reportedly run at over 5.5 tokens/second on a 48GB MacBook Pro M3 Max, despite the model being about 209GB on disk (about 120GB quantized).
- Claude Code was reportedly used in an autoresearch-style workflow to run about 90 experiments and generate MLX Objective-C and Metal code optimized for efficiency.
- The corpus expresses uncertainty about the output-quality impact, noting that a claim that 2-bit is indistinguishable from 4-bit is supported by only thinly described evaluations.
- A change to 4-bit quantization was reportedly motivated by a finding that 2-bit quantization broke tool calling while 4-bit handled tool calling well.
- Because the model is Mixture-of-Experts, only a subset of expert weights is needed per token and those experts can be streamed from SSD into memory on demand rather than kept resident in RAM.
Sections
Limited-Memory Local Inference Via Flash/Ssd Streaming For Moe Models
- A custom Qwen3.5-397B-A17B variant was reportedly run at over 5.5 tokens/second on a 48GB MacBook Pro M3 Max, despite the model being about 209GB on disk (about 120GB quantized).
- Because the model is Mixture-of-Experts, only a subset of expert weights is needed per token and those experts can be streamed from SSD into memory on demand rather than kept resident in RAM.
- The implementation reportedly used techniques described in Apple’s 2023 paper “LLM in a flash: Efficient Large Language Model Inference with Limited Memory.”
- The cited Apple method runs models that exceed DRAM by storing parameters in flash and moving them into DRAM on demand using a cost model to reduce transfer volume and favor larger contiguous reads.
- In the described setup, expert weights were quantized to 2-bit while non-expert components (including embeddings and routing matrices) remained at original precision, totaling about 5.5GB resident in memory during inference.
Engineering Levers And Workflow: Hybrid Quantization, Expert-Count Knob, And Ai-Assisted Iteration
- Claude Code was reportedly used in an autoresearch-style workflow to run about 90 experiments and generate MLX Objective-C and Metal code optimized for efficiency.
- In the described setup, expert weights were quantized to 2-bit while non-expert components (including embeddings and routing matrices) remained at original precision, totaling about 5.5GB resident in memory during inference.
- The implementation reportedly reduced experts used per token from the model’s usual 10 to 4, and it was claimed that the largest quality drop occurred at 3 experts.
Quality And Functionality Risk: Evaluation Thinness And Tool-Calling Regressions From Aggressive Quantization
- The corpus expresses uncertainty about the output-quality impact, noting that a claim that 2-bit is indistinguishable from 4-bit is supported by only thinly described evaluations.
- A change to 4-bit quantization was reportedly motivated by a finding that 2-bit quantization broke tool calling while 4-bit handled tool calling well.
Unknowns
- What exact benchmark setup produced the reported >5.5 tokens/second (prompt length, generation length, sampling parameters, batch size, and whether timing included I/O warm-up and streaming overhead)?
- What is the end-to-end latency profile (time-to-first-token, tail latency, and variability) when streaming experts from SSD, especially under concurrent system load?
- How does output quality change under the hybrid-quantization and expert-count modifications, measured with fully specified evaluation datasets, prompts, and metrics?
- What are the precise conditions under which 2-bit quantization breaks tool calling, and what tool-calling performance is achieved at 4-bit across a standardized suite of tools and schemas?
- What parts of the implementation are essential versus incidental (e.g., the specific I/O cost model, weight layout choices, caching strategy, and Metal/MLX kernel details)?