Demonstrations: Frontier-Scale Moe Class Models On Consumer Devices With Low Throughput

Issue 83 Edition 2026-03-24 5 min read

Not accepted General

Sources: 1 • Confidence: Medium • Updated: 2026-04-13 03:53

Key takeaways

@seikixtc reported running Kimi K2.5 (1T parameters with 32B active weights at a time) in 96GB of RAM on an M2 Max MacBook Pro.
Dan Woods and collaborators are running autoresearch loops to find further optimizations for streamed-expert inference performance.
Streaming-experts enables running large Mixture-of-Experts models on insufficient-RAM hardware by streaming the required expert weights from SSD for each processed token instead of loading the entire model into memory.
@anemll demonstrated Qwen3.5-397B-A17B running on an iPhone at approximately 0.6 tokens per second.
Daniel Isaac got Kimi K2.5 working on a 128GB M4 Max at about 1.7 tokens per second.

@seikixtc reported running Kimi K2.5 (1T parameters with 32B active weights at a time) in 96GB of RAM on an M2 Max MacBook Pro.
@anemll demonstrated Qwen3.5-397B-A17B running on an iPhone at approximately 0.6 tokens per second.
Daniel Isaac got Kimi K2.5 working on a 128GB M4 Max at about 1.7 tokens per second.

Dan Woods and collaborators are running autoresearch loops to find further optimizations for streamed-expert inference performance.
The author states an expectation that the streaming-experts technique will be broadly useful and continue to advance.

Streaming-experts enables running large Mixture-of-Experts models on insufficient-RAM hardware by streaming the required expert weights from SSD for each processed token instead of loading the entire model into memory.

Dan Woods and collaborators are running autoresearch loops to find further optimizations for streamed-expert inference performance.

What are the exact benchmark methodologies behind the reported throughput numbers (prompt setup, context length, batch size, warm-up, and whether throughput is sustained over time)?
What SSD bandwidth, latency, and I/O patterns are required for streamed-expert inference to work well, and how sensitive is performance to SSD speed?
How does expert-selection behavior affect I/O locality (cacheability) and performance variance across different prompts and tasks?
What are the power draw, thermals, and throttling characteristics for these demonstrations on laptops and especially on phones during sustained inference?
What accuracy or output-quality tradeoffs (if any) are introduced by the specific setup used for streaming experts (including any model format choices), compared to non-streamed baselines?

On device inference for very large MoE models may become more feasible via SSD streamed expert weights, shifting constraints from RAM capacity toward storage bandwidth and software optimization.
If streamed experts becomes practical, demand may rise for devices with faster SSDs and better thermal headroom to sustain token generation over longer sessions.
Active optimization loops for streamed expert inference suggest near term performance volatility, with software level improvements potentially driving usability without new hardware.

Reproducible benchmarks that report prompt setup, context length, batch size, warm up and sustained throughput for streamed expert inference on consumer laptops and phones.
Measured SSD bandwidth and I O patterns showing stable performance across prompts, with improved caching or locality and reduced variance in tokens per second.
Sustained power and thermal measurements demonstrating limited throttling during extended inference, plus comparisons showing minimal quality loss versus non streamed baselines.

Benchmarks showing throughput collapses or becomes highly variable due to SSD bottlenecks, poor expert selection locality, or I O latency sensitivity across tasks.
Thermal or power constraints on laptops or phones causing rapid throttling, making reported tokens per second unsustainable for real workloads.
Clear evidence of meaningful accuracy or output quality degradation from the streamed expert setup or required model format changes compared to non streamed baselines.