Empirical Demonstrations On Commodity And Mobile Hardware

Issue 83 Edition 2026-03-24 5 min read

Not accepted General

Sources: 1 • Confidence: Medium • Updated: 2026-04-12 10:19

Key takeaways

@seikixtc reported running Kimi K2.5 (1T parameters, 32B active at a time) in 96GB RAM on an M2 Max MacBook Pro.
Dan Woods and collaborators are running autoresearch loops to find further performance optimizations for streamed-expert inference.
The streaming-experts approach runs large Mixture-of-Experts models on insufficient-RAM hardware by streaming required expert weights from SSD per processed token rather than loading the entire model into memory.
@anemll reported running Qwen3.5-397B-A17B on an iPhone at approximately 0.6 tokens per second.
Daniel Isaac reported getting Kimi K2.5 working on a 128GB M4 Max at about 1.7 tokens per second.

@seikixtc reported running Kimi K2.5 (1T parameters, 32B active at a time) in 96GB RAM on an M2 Max MacBook Pro.
@anemll reported running Qwen3.5-397B-A17B on an iPhone at approximately 0.6 tokens per second.
Daniel Isaac reported getting Kimi K2.5 working on a 128GB M4 Max at about 1.7 tokens per second.

Dan Woods and collaborators are running autoresearch loops to find further performance optimizations for streamed-expert inference.
The author expects the streaming-experts technique to be broadly useful and to continue advancing.

The streaming-experts approach runs large Mixture-of-Experts models on insufficient-RAM hardware by streaming required expert weights from SSD per processed token rather than loading the entire model into memory.

Dan Woods and collaborators are running autoresearch loops to find further performance optimizations for streamed-expert inference.

What are the end-to-end latency distributions (including worst-case per-token stalls) when streaming experts from SSD under different expert-selection patterns?
What SSD bandwidth/IOPS and caching strategies are required to achieve the reported performance, and how do results change on slower storage?
Are the reported runs reproducible by independent users, and what exact software/quantization/settings were used?
What are the power draw, thermals, and sustained performance characteristics for mobile and laptop deployments over long sessions?
What quality/performance trade-offs (if any) are introduced by the streaming approach compared with keeping weights resident (for example via different quantization levels or expert caching behavior)?

Streaming expert weights from SSD may shift local inference bottlenecks from RAM capacity toward storage bandwidth and latency, potentially enabling very large MoE-class models on consumer laptops and phones if storage and caching are sufficient.
If reproducible, streamed-expert inference could increase the value of high throughput SSD and memory architectures in consumer hardware used for on device AI, as performance may depend on storage IOPS and sustained bandwidth.
Ongoing autoresearch optimization loops suggest near term performance tuning may be possible, implying a moving benchmark for tokens per second on commodity devices if software and scheduling improve.

Independent reproductions with shared settings showing end to end per token latency distributions, including worst case stalls, across multiple devices using streamed experts.
Benchmarks that map tokens per second to SSD bandwidth and IOPS and to caching strategy, demonstrating predictable degradation on slower storage and improvement on faster storage.
Sustained session measurements reporting power draw, thermals, and stable throughput over time on phones and laptops, plus documented quality and quantization trade offs versus resident weights.

Repeated failures to reproduce the reported throughput on similar hardware when software, quantization, and settings are disclosed and standardized.
Latency and stall behavior that is highly variable or unacceptable in practice, driven by expert selection patterns and storage access, even with fast SSD and caching.
Evidence that required storage performance or power and thermal limits prevent sustained use on consumer devices, or that quality degradation from streaming and quantization is material versus resident loading.