Rosa Del Mar

Daily Brief

Issue 77 2026-03-18

Limited-Memory Local Inference Via Flash/Ssd Streaming For Moe Models

Issue 77 Edition 2026-03-18 6 min read
General
Sources: 1 • Confidence: Medium • Updated: 2026-04-12 10:17

Key takeaways

  • A custom Qwen3.5-397B-A17B variant was reportedly run at over 5.5 tokens/second on a 48GB MacBook Pro M3 Max, despite the model being about 209GB on disk (about 120GB quantized).
  • Claude Code was reportedly used in an autoresearch-style workflow to run about 90 experiments and generate MLX Objective-C and Metal code optimized for efficiency.
  • The corpus expresses uncertainty about the output-quality impact, noting that a claim that 2-bit is indistinguishable from 4-bit is supported by only thinly described evaluations.
  • A change to 4-bit quantization was reportedly motivated by a finding that 2-bit quantization broke tool calling while 4-bit handled tool calling well.
  • Because the model is Mixture-of-Experts, only a subset of expert weights is needed per token and those experts can be streamed from SSD into memory on demand rather than kept resident in RAM.

Sections

Limited-Memory Local Inference Via Flash/Ssd Streaming For Moe Models

  • A custom Qwen3.5-397B-A17B variant was reportedly run at over 5.5 tokens/second on a 48GB MacBook Pro M3 Max, despite the model being about 209GB on disk (about 120GB quantized).
  • Because the model is Mixture-of-Experts, only a subset of expert weights is needed per token and those experts can be streamed from SSD into memory on demand rather than kept resident in RAM.
  • The implementation reportedly used techniques described in Apple’s 2023 paper “LLM in a flash: Efficient Large Language Model Inference with Limited Memory.”
  • The cited Apple method runs models that exceed DRAM by storing parameters in flash and moving them into DRAM on demand using a cost model to reduce transfer volume and favor larger contiguous reads.
  • In the described setup, expert weights were quantized to 2-bit while non-expert components (including embeddings and routing matrices) remained at original precision, totaling about 5.5GB resident in memory during inference.

Engineering Levers And Workflow: Hybrid Quantization, Expert-Count Knob, And Ai-Assisted Iteration

  • Claude Code was reportedly used in an autoresearch-style workflow to run about 90 experiments and generate MLX Objective-C and Metal code optimized for efficiency.
  • In the described setup, expert weights were quantized to 2-bit while non-expert components (including embeddings and routing matrices) remained at original precision, totaling about 5.5GB resident in memory during inference.
  • The implementation reportedly reduced experts used per token from the model’s usual 10 to 4, and it was claimed that the largest quality drop occurred at 3 experts.

Quality And Functionality Risk: Evaluation Thinness And Tool-Calling Regressions From Aggressive Quantization

  • The corpus expresses uncertainty about the output-quality impact, noting that a claim that 2-bit is indistinguishable from 4-bit is supported by only thinly described evaluations.
  • A change to 4-bit quantization was reportedly motivated by a finding that 2-bit quantization broke tool calling while 4-bit handled tool calling well.

Unknowns

  • What exact benchmark setup produced the reported >5.5 tokens/second (prompt length, generation length, sampling parameters, batch size, and whether timing included I/O warm-up and streaming overhead)?
  • What is the end-to-end latency profile (time-to-first-token, tail latency, and variability) when streaming experts from SSD, especially under concurrent system load?
  • How does output quality change under the hybrid-quantization and expert-count modifications, measured with fully specified evaluation datasets, prompts, and metrics?
  • What are the precise conditions under which 2-bit quantization breaks tool calling, and what tool-calling performance is achieved at 4-bit across a standardized suite of tools and schemas?
  • What parts of the implementation are essential versus incidental (e.g., the specific I/O cost model, weight layout choices, caching strategy, and Metal/MLX kernel details)?

Investor overlay

Read-throughs

  • Local on-device inference for very large MoE models could become more feasible on memory-limited consumer hardware if SSD streaming plus I O-aware layouts achieve usable throughput without unacceptable latency or quality loss.
  • Hybrid quantization and experts-per-token reduction may become practical knobs for deploying MoE models locally, but may create discrete capability failures such as tool calling regressions under aggressive quantization.
  • AI-assisted engineering workflows using iterative experiments and code generation could shorten optimization cycles for hardware-specific inference stacks, potentially accelerating performance tuning for Metal and MLX style backends.

What would confirm

  • Fully specified benchmarks reproduce over 5.5 tokens per second with disclosed prompt and generation lengths, sampling, batch size, and whether timings include streaming overhead and warm up.
  • End-to-end latency results show stable time to first token and tail latency while streaming experts from SSD, including tests under concurrent system load and repeated runs.
  • Standardized evaluations show tool calling works reliably at 4-bit and identify clear thresholds where 2-bit fails, using a defined suite of tools, schemas, and metrics.

What would kill

  • Reproductions show throughput collapses or becomes highly variable once streaming overhead is included, or time to first token and tail latency become impractical for interactive use.
  • Quality or functionality degrades materially under hybrid quantization or reduced experts-per-token, especially on task-critical behaviors like tool calling, and cannot be mitigated without much higher precision.
  • The approach depends on highly specific kernel, layout, or caching details that do not generalize beyond a narrow setup, limiting applicability to broader hardware or model variants.

Sources