Reported Consumer-Laptop Feasibility For Trillion-Parameter-Class Moe
Sources: 1 • Confidence: Low • Updated: 2026-03-25 17:55
Key takeaways
- A report claims Kimi K2.5 (1T parameters with 32B active weights) was run in 96GB of RAM on an M2 Max MacBook Pro.
- Dan Woods and collaborators are running autoresearch loops to find optimizations that increase performance for streamed-expert inference.
- The streaming-experts technique runs large Mixture-of-Experts models on insufficient-RAM hardware by streaming required expert weights from SSD for each processed token instead of loading the entire model into memory.
- A demonstration claims Qwen3.5-397B-A17B ran on an iPhone at approximately 0.6 tokens per second.
- Daniel Isaac got Kimi K2.5 working on a 128GB M4 Max at about 1.7 tokens per second.
Sections
Reported Consumer-Laptop Feasibility For Trillion-Parameter-Class Moe
- A report claims Kimi K2.5 (1T parameters with 32B active weights) was run in 96GB of RAM on an M2 Max MacBook Pro.
- Daniel Isaac got Kimi K2.5 working on a 128GB M4 Max at about 1.7 tokens per second.
Active Optimization Efforts And Expectation Of Near-Term Improvement
- Dan Woods and collaborators are running autoresearch loops to find optimizations that increase performance for streamed-expert inference.
- The author expects the streaming-experts technique to be broadly useful and to continue to advance.
Streamed-Expert Inference As A Memory Workaround For Large Moe
- The streaming-experts technique runs large Mixture-of-Experts models on insufficient-RAM hardware by streaming required expert weights from SSD for each processed token instead of loading the entire model into memory.
Reported Mobile On-Device Execution Of Very Large Models Via Streaming
- A demonstration claims Qwen3.5-397B-A17B ran on an iPhone at approximately 0.6 tokens per second.
Watchlist
- Dan Woods and collaborators are running autoresearch loops to find optimizations that increase performance for streamed-expert inference.
Unknowns
- What were the exact benchmark methodologies and workloads (prompt types, context length, batch size, decoding settings) used for the reported tokens/sec figures?
- What storage bandwidth and latency characteristics (SSD class, interface, read amplification) are required for streamed-expert inference to work well?
- What is the sustained (not peak) performance over long sessions, including thermals and power draw, especially on mobile devices?
- How reproducible are the reported runs across different hardware configurations and software environments?
- What is the end-to-end user-perceived latency profile (time-to-first-token, jitter) when experts are streamed per token?