Demonstrations: Frontier-Scale Moe Class Models On Consumer Devices With Low Throughput
Sources: 1 • Confidence: Medium • Updated: 2026-04-13 03:53
Key takeaways
- @seikixtc reported running Kimi K2.5 (1T parameters with 32B active weights at a time) in 96GB of RAM on an M2 Max MacBook Pro.
- Dan Woods and collaborators are running autoresearch loops to find further optimizations for streamed-expert inference performance.
- Streaming-experts enables running large Mixture-of-Experts models on insufficient-RAM hardware by streaming the required expert weights from SSD for each processed token instead of loading the entire model into memory.
- @anemll demonstrated Qwen3.5-397B-A17B running on an iPhone at approximately 0.6 tokens per second.
- Daniel Isaac got Kimi K2.5 working on a 128GB M4 Max at about 1.7 tokens per second.
Sections
Demonstrations: Frontier-Scale Moe Class Models On Consumer Devices With Low Throughput
- @seikixtc reported running Kimi K2.5 (1T parameters with 32B active weights at a time) in 96GB of RAM on an M2 Max MacBook Pro.
- @anemll demonstrated Qwen3.5-397B-A17B running on an iPhone at approximately 0.6 tokens per second.
- Daniel Isaac got Kimi K2.5 working on a 128GB M4 Max at about 1.7 tokens per second.
Trajectory And Watch Items: Expectation Of Broad Usefulness And Active Optimization Loops
- Dan Woods and collaborators are running autoresearch loops to find further optimizations for streamed-expert inference performance.
- The author states an expectation that the streaming-experts technique will be broadly useful and continue to advance.
Mechanism: Ssd Streaming Of Moe Expert Weights To Reduce Ram Requirements
- Streaming-experts enables running large Mixture-of-Experts models on insufficient-RAM hardware by streaming the required expert weights from SSD for each processed token instead of loading the entire model into memory.
Watchlist
- Dan Woods and collaborators are running autoresearch loops to find further optimizations for streamed-expert inference performance.
Unknowns
- What are the exact benchmark methodologies behind the reported throughput numbers (prompt setup, context length, batch size, warm-up, and whether throughput is sustained over time)?
- What SSD bandwidth, latency, and I/O patterns are required for streamed-expert inference to work well, and how sensitive is performance to SSD speed?
- How does expert-selection behavior affect I/O locality (cacheability) and performance variance across different prompts and tasks?
- What are the power draw, thermals, and throttling characteristics for these demonstrations on laptops and especially on phones during sustained inference?
- What accuracy or output-quality tradeoffs (if any) are introduced by the specific setup used for streaming experts (including any model format choices), compared to non-streamed baselines?