Limited-Memory Local Inference Via Flash/Ssd Streaming For Moe Models

Issue 77 Edition 2026-03-18 6 min read

General

Sources: 1 • Confidence: Medium • Updated: 2026-04-12 10:17

Key takeaways

A custom Qwen3.5-397B-A17B variant was reportedly run at over 5.5 tokens/second on a 48GB MacBook Pro M3 Max, despite the model being about 209GB on disk (about 120GB quantized).
Claude Code was reportedly used in an autoresearch-style workflow to run about 90 experiments and generate MLX Objective-C and Metal code optimized for efficiency.
The corpus expresses uncertainty about the output-quality impact, noting that a claim that 2-bit is indistinguishable from 4-bit is supported by only thinly described evaluations.
A change to 4-bit quantization was reportedly motivated by a finding that 2-bit quantization broke tool calling while 4-bit handled tool calling well.
Because the model is Mixture-of-Experts, only a subset of expert weights is needed per token and those experts can be streamed from SSD into memory on demand rather than kept resident in RAM.

A custom Qwen3.5-397B-A17B variant was reportedly run at over 5.5 tokens/second on a 48GB MacBook Pro M3 Max, despite the model being about 209GB on disk (about 120GB quantized).
Because the model is Mixture-of-Experts, only a subset of expert weights is needed per token and those experts can be streamed from SSD into memory on demand rather than kept resident in RAM.
The implementation reportedly used techniques described in Apple’s 2023 paper “LLM in a flash: Efficient Large Language Model Inference with Limited Memory.”
The cited Apple method runs models that exceed DRAM by storing parameters in flash and moving them into DRAM on demand using a cost model to reduce transfer volume and favor larger contiguous reads.
In the described setup, expert weights were quantized to 2-bit while non-expert components (including embeddings and routing matrices) remained at original precision, totaling about 5.5GB resident in memory during inference.

Claude Code was reportedly used in an autoresearch-style workflow to run about 90 experiments and generate MLX Objective-C and Metal code optimized for efficiency.
In the described setup, expert weights were quantized to 2-bit while non-expert components (including embeddings and routing matrices) remained at original precision, totaling about 5.5GB resident in memory during inference.
The implementation reportedly reduced experts used per token from the model’s usual 10 to 4, and it was claimed that the largest quality drop occurred at 3 experts.

The corpus expresses uncertainty about the output-quality impact, noting that a claim that 2-bit is indistinguishable from 4-bit is supported by only thinly described evaluations.
A change to 4-bit quantization was reportedly motivated by a finding that 2-bit quantization broke tool calling while 4-bit handled tool calling well.

What exact benchmark setup produced the reported >5.5 tokens/second (prompt length, generation length, sampling parameters, batch size, and whether timing included I/O warm-up and streaming overhead)?
What is the end-to-end latency profile (time-to-first-token, tail latency, and variability) when streaming experts from SSD, especially under concurrent system load?
How does output quality change under the hybrid-quantization and expert-count modifications, measured with fully specified evaluation datasets, prompts, and metrics?
What are the precise conditions under which 2-bit quantization breaks tool calling, and what tool-calling performance is achieved at 4-bit across a standardized suite of tools and schemas?
What parts of the implementation are essential versus incidental (e.g., the specific I/O cost model, weight layout choices, caching strategy, and Metal/MLX kernel details)?

Local on-device inference for very large MoE models could become more feasible on memory-limited consumer hardware if SSD streaming plus I O-aware layouts achieve usable throughput without unacceptable latency or quality loss.
Hybrid quantization and experts-per-token reduction may become practical knobs for deploying MoE models locally, but may create discrete capability failures such as tool calling regressions under aggressive quantization.
AI-assisted engineering workflows using iterative experiments and code generation could shorten optimization cycles for hardware-specific inference stacks, potentially accelerating performance tuning for Metal and MLX style backends.

Fully specified benchmarks reproduce over 5.5 tokens per second with disclosed prompt and generation lengths, sampling, batch size, and whether timings include streaming overhead and warm up.
End-to-end latency results show stable time to first token and tail latency while streaming experts from SSD, including tests under concurrent system load and repeated runs.
Standardized evaluations show tool calling works reliably at 4-bit and identify clear thresholds where 2-bit fails, using a defined suite of tools, schemas, and metrics.

Reproductions show throughput collapses or becomes highly variable once streaming overhead is included, or time to first token and tail latency become impractical for interactive use.
Quality or functionality degrades materially under hybrid quantization or reduced experts-per-token, especially on task-critical behaviors like tool calling, and cannot be mitigated without much higher precision.
The approach depends on highly specific kernel, layout, or caching details that do not generalize beyond a narrow setup, limiting applicability to broader hardware or model variants.