Limited-Memory Local Inference Via Flash/Ssd Streaming For Moe Models

Issue 77 Edition 2026-03-18 7 min read

General

Sources: 1 • Confidence: High • Updated: 2026-04-13 03:51

Key takeaways

Dan Woods reportedly ran a custom Qwen3.5-397B-A17B variant at over 5.5 tokens per second on a 48GB MacBook Pro M3 Max, despite the model being about 209GB on disk (about 120GB quantized).
The corpus asserts that uncertainty remains about output-quality impact because the evaluations supporting a claim that 2-bit is indistinguishable from 4-bit are thinly described.
Dan Woods reportedly used Claude Code with an autoresearch-style workflow to run about 90 experiments and generate MLX Objective-C and Metal code optimized for efficiency.
In the described work, moving from 2-bit to 4-bit quantization was motivated by a finding that the 2-bit version broke tool calling while the 4-bit version handled tool calling well.
Because Qwen3.5-397B-A17B is a Mixture-of-Experts model, only a subset of expert weights is needed per token, enabling expert weights to be streamed from SSD into memory on demand rather than kept resident in RAM.

Dan Woods reportedly ran a custom Qwen3.5-397B-A17B variant at over 5.5 tokens per second on a 48GB MacBook Pro M3 Max, despite the model being about 209GB on disk (about 120GB quantized).
Because Qwen3.5-397B-A17B is a Mixture-of-Experts model, only a subset of expert weights is needed per token, enabling expert weights to be streamed from SSD into memory on demand rather than kept resident in RAM.
Dan Woods reportedly used techniques described in Apple's 2023 paper "LLM in a flash: Efficient Large Language Model Inference with Limited Memory."
The cited Apple method runs models that exceed DRAM capacity by storing parameters in flash and moving them into DRAM on demand using a cost model to reduce transfer volume and favor larger contiguous reads.
In the described setup, expert weights are quantized to 2-bit while non-expert components such as embeddings and routing matrices remain at original precision, totaling about 5.5GB resident in memory during inference.
In the described implementation, the number of experts used per token was reduced from a usual value of 10 to 4.

The corpus asserts that uncertainty remains about output-quality impact because the evaluations supporting a claim that 2-bit is indistinguishable from 4-bit are thinly described.
In the described work, moving from 2-bit to 4-bit quantization was motivated by a finding that the 2-bit version broke tool calling while the 4-bit version handled tool calling well.
In the described implementation, the number of experts used per token was reduced from a usual value of 10 to 4.
It is claimed that the largest quality drop (under expert-count reduction) occurred when using 3 experts per token.

Dan Woods reportedly used Claude Code with an autoresearch-style workflow to run about 90 experiments and generate MLX Objective-C and Metal code optimized for efficiency.
The repository danveloper/flash-moe reportedly contains the resulting code and a PDF paper documenting the experiment, with the PDF mostly written by Claude Opus 4.6.

What exact benchmark methodology and runtime settings produced the reported >5.5 tokens/sec result (including prompt length, caching behavior, sampling parameters, batch size, and whether the model was warmed up)?
What are the precise quantization formats and implementation details used (including how 2-bit/4-bit were applied, per-layer or per-tensor choices, and how non-expert components were kept at higher precision)?
What standardized evaluation suite, prompts, datasets, and scoring were used to assess quality, including the claim about where the largest quality drop occurs when varying experts-per-token?
How robust is tool calling across multiple tools and schema complexities for 2-bit vs 4-bit (and across different expert counts), and what is the failure mode when it breaks?
What are the latency distributions and I/O characteristics (e.g., SSD read sizes, contiguity, stalls) during inference, and how sensitive are results to SSD performance?

MoE weight streaming from SSD could expand the addressable market for local inference on memory limited devices by enabling larger models than DRAM, potentially shifting demand toward SSD bandwidth and software stacks that optimize contiguous reads.
Tool calling sensitivity to quantization suggests quality, not raw throughput, may be the gating factor for deploying aggressively compressed local models, creating value for tooling that validates function calling robustness across quantization and routing settings.
AI assisted performance engineering workflows may compress optimization cycles for on device inference stacks, implying competitive advantage for teams that can rapidly iterate low level kernels and memory I O scheduling using such workflows.

Reproducible benchmarks showing sustained throughput over 5.5 tokens per second with disclosed settings, including prompt length, caching, sampling, batch size, warmup, and latency distribution during SSD streaming.
Public details of quantization formats and implementation choices, plus standardized evaluations showing tool calling reliability differences between 2 bit and 4 bit across multiple tools and schemas.
I O traces demonstrating predictable contiguous SSD reads, limited stalls, and sensitivity analysis across SSD performance tiers, confirming the cost model and streaming approach generalize beyond one setup.

Independent replication fails or performance collapses when benchmark conditions are fully specified, such as longer prompts, cold starts, different sampling, or realistic tool calling workloads.
Standardized quality tests show large regressions at 4 bit or when reducing experts per token, especially in tool calling, making the approach impractical for common agentic use cases.
Inference shows frequent I O stalls, small random reads, or extreme dependence on a specific SSD, indicating streaming is not robust or is too hardware specific to be broadly useful.