Reported Consumer-Laptop Feasibility For Trillion-Parameter-Class Moe

Issue 83 Edition 2026-03-24 5 min read

Not accepted General

Sources: 1 • Confidence: Low • Updated: 2026-03-25 17:55

Key takeaways

A report claims Kimi K2.5 (1T parameters with 32B active weights) was run in 96GB of RAM on an M2 Max MacBook Pro.
Dan Woods and collaborators are running autoresearch loops to find optimizations that increase performance for streamed-expert inference.
The streaming-experts technique runs large Mixture-of-Experts models on insufficient-RAM hardware by streaming required expert weights from SSD for each processed token instead of loading the entire model into memory.
A demonstration claims Qwen3.5-397B-A17B ran on an iPhone at approximately 0.6 tokens per second.
Daniel Isaac got Kimi K2.5 working on a 128GB M4 Max at about 1.7 tokens per second.

A report claims Kimi K2.5 (1T parameters with 32B active weights) was run in 96GB of RAM on an M2 Max MacBook Pro.
Daniel Isaac got Kimi K2.5 working on a 128GB M4 Max at about 1.7 tokens per second.

Dan Woods and collaborators are running autoresearch loops to find optimizations that increase performance for streamed-expert inference.
The author expects the streaming-experts technique to be broadly useful and to continue to advance.

The streaming-experts technique runs large Mixture-of-Experts models on insufficient-RAM hardware by streaming required expert weights from SSD for each processed token instead of loading the entire model into memory.

A demonstration claims Qwen3.5-397B-A17B ran on an iPhone at approximately 0.6 tokens per second.

Dan Woods and collaborators are running autoresearch loops to find optimizations that increase performance for streamed-expert inference.

What were the exact benchmark methodologies and workloads (prompt types, context length, batch size, decoding settings) used for the reported tokens/sec figures?
What storage bandwidth and latency characteristics (SSD class, interface, read amplification) are required for streamed-expert inference to work well?
What is the sustained (not peak) performance over long sessions, including thermals and power draw, especially on mobile devices?
How reproducible are the reported runs across different hardware configurations and software environments?
What is the end-to-end user-perceived latency profile (time-to-first-token, jitter) when experts are streamed per token?

If streamed-expert inference is practical, demand could rise for high bandwidth SSDs and fast storage interfaces because weights are fetched per token rather than kept in RAM
If large MoE models run acceptably on 96GB to 128GB Apple silicon, local AI experimentation could shift toward higher end consumer devices, potentially affecting hardware upgrade cycles
Ongoing autoresearch optimization loops suggest near term improvements in streamed-expert inference throughput and latency, which could shorten benchmark shelf life and accelerate adoption if results generalize

Reproducible benchmarks across hardware and software showing sustained tokens per second plus time to first token and jitter for streamed-expert inference over long sessions
Clear reporting of workload methodology including prompt types, context length, batch size, decoding settings and comparison against non streaming baselines on the same devices
Evidence that required SSD bandwidth and latency are achievable on common consumer devices without severe thermal or power throttling, including sustained performance on mobile

Independent replication fails or shows materially lower sustained throughput once thermals, power draw, long context, or real workloads are used
Storage constraints dominate such that per token expert streaming causes unacceptable latency, jitter, or SSD bottlenecks on typical consumer hardware
Performance gains from optimization efforts do not generalize beyond narrow setups, with results highly sensitive to specific hardware configurations or software environments