Empirical Demonstrations On Commodity And Mobile Hardware
Sources: 1 • Confidence: Medium • Updated: 2026-04-12 10:19
Key takeaways
- @seikixtc reported running Kimi K2.5 (1T parameters, 32B active at a time) in 96GB RAM on an M2 Max MacBook Pro.
- Dan Woods and collaborators are running autoresearch loops to find further performance optimizations for streamed-expert inference.
- The streaming-experts approach runs large Mixture-of-Experts models on insufficient-RAM hardware by streaming required expert weights from SSD per processed token rather than loading the entire model into memory.
- @anemll reported running Qwen3.5-397B-A17B on an iPhone at approximately 0.6 tokens per second.
- Daniel Isaac reported getting Kimi K2.5 working on a 128GB M4 Max at about 1.7 tokens per second.
Sections
Empirical Demonstrations On Commodity And Mobile Hardware
- @seikixtc reported running Kimi K2.5 (1T parameters, 32B active at a time) in 96GB RAM on an M2 Max MacBook Pro.
- @anemll reported running Qwen3.5-397B-A17B on an iPhone at approximately 0.6 tokens per second.
- Daniel Isaac reported getting Kimi K2.5 working on a 128GB M4 Max at about 1.7 tokens per second.
Expectations And Active Optimization Efforts
- Dan Woods and collaborators are running autoresearch loops to find further performance optimizations for streamed-expert inference.
- The author expects the streaming-experts technique to be broadly useful and to continue advancing.
Mechanism: Streamed Expert Weights For Moe Inference
- The streaming-experts approach runs large Mixture-of-Experts models on insufficient-RAM hardware by streaming required expert weights from SSD per processed token rather than loading the entire model into memory.
Watchlist
- Dan Woods and collaborators are running autoresearch loops to find further performance optimizations for streamed-expert inference.
Unknowns
- What are the end-to-end latency distributions (including worst-case per-token stalls) when streaming experts from SSD under different expert-selection patterns?
- What SSD bandwidth/IOPS and caching strategies are required to achieve the reported performance, and how do results change on slower storage?
- Are the reported runs reproducible by independent users, and what exact software/quantization/settings were used?
- What are the power draw, thermals, and sustained performance characteristics for mobile and laptop deployments over long sessions?
- What quality/performance trade-offs (if any) are introduced by the streaming approach compared with keeping weights resident (for example via different quantization levels or expert caching behavior)?