Rosa Del Mar

Daily Brief

Issue 83 2026-03-24

Demonstrations: Frontier-Scale Moe Class Models On Consumer Devices With Low Throughput

Issue 83 Edition 2026-03-24 5 min read
Not accepted General
Sources: 1 • Confidence: Medium • Updated: 2026-04-13 03:53

Key takeaways

  • @seikixtc reported running Kimi K2.5 (1T parameters with 32B active weights at a time) in 96GB of RAM on an M2 Max MacBook Pro.
  • Dan Woods and collaborators are running autoresearch loops to find further optimizations for streamed-expert inference performance.
  • Streaming-experts enables running large Mixture-of-Experts models on insufficient-RAM hardware by streaming the required expert weights from SSD for each processed token instead of loading the entire model into memory.
  • @anemll demonstrated Qwen3.5-397B-A17B running on an iPhone at approximately 0.6 tokens per second.
  • Daniel Isaac got Kimi K2.5 working on a 128GB M4 Max at about 1.7 tokens per second.

Sections

Demonstrations: Frontier-Scale Moe Class Models On Consumer Devices With Low Throughput

  • @seikixtc reported running Kimi K2.5 (1T parameters with 32B active weights at a time) in 96GB of RAM on an M2 Max MacBook Pro.
  • @anemll demonstrated Qwen3.5-397B-A17B running on an iPhone at approximately 0.6 tokens per second.
  • Daniel Isaac got Kimi K2.5 working on a 128GB M4 Max at about 1.7 tokens per second.

Trajectory And Watch Items: Expectation Of Broad Usefulness And Active Optimization Loops

  • Dan Woods and collaborators are running autoresearch loops to find further optimizations for streamed-expert inference performance.
  • The author states an expectation that the streaming-experts technique will be broadly useful and continue to advance.

Mechanism: Ssd Streaming Of Moe Expert Weights To Reduce Ram Requirements

  • Streaming-experts enables running large Mixture-of-Experts models on insufficient-RAM hardware by streaming the required expert weights from SSD for each processed token instead of loading the entire model into memory.

Watchlist

  • Dan Woods and collaborators are running autoresearch loops to find further optimizations for streamed-expert inference performance.

Unknowns

  • What are the exact benchmark methodologies behind the reported throughput numbers (prompt setup, context length, batch size, warm-up, and whether throughput is sustained over time)?
  • What SSD bandwidth, latency, and I/O patterns are required for streamed-expert inference to work well, and how sensitive is performance to SSD speed?
  • How does expert-selection behavior affect I/O locality (cacheability) and performance variance across different prompts and tasks?
  • What are the power draw, thermals, and throttling characteristics for these demonstrations on laptops and especially on phones during sustained inference?
  • What accuracy or output-quality tradeoffs (if any) are introduced by the specific setup used for streaming experts (including any model format choices), compared to non-streamed baselines?

Investor overlay

Read-throughs

  • On device inference for very large MoE models may become more feasible via SSD streamed expert weights, shifting constraints from RAM capacity toward storage bandwidth and software optimization.
  • If streamed experts becomes practical, demand may rise for devices with faster SSDs and better thermal headroom to sustain token generation over longer sessions.
  • Active optimization loops for streamed expert inference suggest near term performance volatility, with software level improvements potentially driving usability without new hardware.

What would confirm

  • Reproducible benchmarks that report prompt setup, context length, batch size, warm up and sustained throughput for streamed expert inference on consumer laptops and phones.
  • Measured SSD bandwidth and I O patterns showing stable performance across prompts, with improved caching or locality and reduced variance in tokens per second.
  • Sustained power and thermal measurements demonstrating limited throttling during extended inference, plus comparisons showing minimal quality loss versus non streamed baselines.

What would kill

  • Benchmarks showing throughput collapses or becomes highly variable due to SSD bottlenecks, poor expert selection locality, or I O latency sensitivity across tasks.
  • Thermal or power constraints on laptops or phones causing rapid throttling, making reported tokens per second unsustainable for real workloads.
  • Clear evidence of meaningful accuracy or output quality degradation from the streamed expert setup or required model format changes compared to non streamed baselines.

Sources

  1. 2026-03-24 simonwillison.net