Rosa Del Mar

Daily Brief

Issue 83 2026-03-24

Reported Consumer-Laptop Feasibility For Trillion-Parameter-Class Moe

Issue 83 Edition 2026-03-24 5 min read
Not accepted General
Sources: 1 • Confidence: Low • Updated: 2026-03-25 17:55

Key takeaways

  • A report claims Kimi K2.5 (1T parameters with 32B active weights) was run in 96GB of RAM on an M2 Max MacBook Pro.
  • Dan Woods and collaborators are running autoresearch loops to find optimizations that increase performance for streamed-expert inference.
  • The streaming-experts technique runs large Mixture-of-Experts models on insufficient-RAM hardware by streaming required expert weights from SSD for each processed token instead of loading the entire model into memory.
  • A demonstration claims Qwen3.5-397B-A17B ran on an iPhone at approximately 0.6 tokens per second.
  • Daniel Isaac got Kimi K2.5 working on a 128GB M4 Max at about 1.7 tokens per second.

Sections

Reported Consumer-Laptop Feasibility For Trillion-Parameter-Class Moe

  • A report claims Kimi K2.5 (1T parameters with 32B active weights) was run in 96GB of RAM on an M2 Max MacBook Pro.
  • Daniel Isaac got Kimi K2.5 working on a 128GB M4 Max at about 1.7 tokens per second.

Active Optimization Efforts And Expectation Of Near-Term Improvement

  • Dan Woods and collaborators are running autoresearch loops to find optimizations that increase performance for streamed-expert inference.
  • The author expects the streaming-experts technique to be broadly useful and to continue to advance.

Streamed-Expert Inference As A Memory Workaround For Large Moe

  • The streaming-experts technique runs large Mixture-of-Experts models on insufficient-RAM hardware by streaming required expert weights from SSD for each processed token instead of loading the entire model into memory.

Reported Mobile On-Device Execution Of Very Large Models Via Streaming

  • A demonstration claims Qwen3.5-397B-A17B ran on an iPhone at approximately 0.6 tokens per second.

Watchlist

  • Dan Woods and collaborators are running autoresearch loops to find optimizations that increase performance for streamed-expert inference.

Unknowns

  • What were the exact benchmark methodologies and workloads (prompt types, context length, batch size, decoding settings) used for the reported tokens/sec figures?
  • What storage bandwidth and latency characteristics (SSD class, interface, read amplification) are required for streamed-expert inference to work well?
  • What is the sustained (not peak) performance over long sessions, including thermals and power draw, especially on mobile devices?
  • How reproducible are the reported runs across different hardware configurations and software environments?
  • What is the end-to-end user-perceived latency profile (time-to-first-token, jitter) when experts are streamed per token?

Investor overlay

Read-throughs

  • If streamed-expert inference is practical, demand could rise for high bandwidth SSDs and fast storage interfaces because weights are fetched per token rather than kept in RAM
  • If large MoE models run acceptably on 96GB to 128GB Apple silicon, local AI experimentation could shift toward higher end consumer devices, potentially affecting hardware upgrade cycles
  • Ongoing autoresearch optimization loops suggest near term improvements in streamed-expert inference throughput and latency, which could shorten benchmark shelf life and accelerate adoption if results generalize

What would confirm

  • Reproducible benchmarks across hardware and software showing sustained tokens per second plus time to first token and jitter for streamed-expert inference over long sessions
  • Clear reporting of workload methodology including prompt types, context length, batch size, decoding settings and comparison against non streaming baselines on the same devices
  • Evidence that required SSD bandwidth and latency are achievable on common consumer devices without severe thermal or power throttling, including sustained performance on mobile

What would kill

  • Independent replication fails or shows materially lower sustained throughput once thermals, power draw, long context, or real workloads are used
  • Storage constraints dominate such that per token expert streaming causes unacceptable latency, jitter, or SSD bottlenecks on typical consumer hardware
  • Performance gains from optimization efforts do not generalize beyond narrow setups, with results highly sensitive to specific hardware configurations or software environments

Sources

  1. 2026-03-24 simonwillison.net