Rosa Del Mar

Daily Brief

Issue 79 2026-03-20

Token Economics Budgeting And Governance As Operational Constraints

Issue 79 Edition 2026-03-20 7 min read
General
Sources: 1 • Confidence: Medium • Updated: 2026-03-25 17:57

Key takeaways

  • As AI usage scales, organizations tend to use a portfolio of models selected for task fit, cost, and performance, managed via a model registry.
  • Azeem Azhar reported his OpenClaw-based agent usage rose from about 100 million tokens/day to a peak day of about 870 million tokens/day, with average days now above 200 million tokens/day.
  • AI workloads are shifting from training-dominant compute toward inference-dominant compute as models are run continuously at scale.
  • A key open question identified is whether compute supply will catch up with demand and where and when bottlenecks will emerge.
  • The episode claims a combined NVIDIA-and-Groq approach could deliver roughly a 35x improvement in throughput per megawatt versus NVIDIA’s current generation.

Sections

Token Economics Budgeting And Governance As Operational Constraints

  • As AI usage scales, organizations tend to use a portfolio of models selected for task fit, cost, and performance, managed via a model registry.
  • Organizations will need governance to control agent behavior and token consumption as agents communicate and coordinate.
  • Treating tokens as an IT-owned cost-center resource is described as a misallocation because tokens represent manufactured cognition tied to business value creation.
  • As agents trigger token-expensive workloads autonomously, firms will need governance mechanisms such as token budgets and approval controls to manage spend and risk.
  • Token budgets are described as a key constraint that can materially shape the value teams extract from AI tools.
  • At Exponential View, a researcher was encouraged to increase token usage by roughly five times, with escalation only if usage needed to go far beyond that.

Agent Harness Layer As Adoption Accelerator

  • Azeem Azhar reported his OpenClaw-based agent usage rose from about 100 million tokens/day to a peak day of about 870 million tokens/day, with average days now above 200 million tokens/day.
  • Exponential View deployed multiple OpenClaw-based internal agents and materially increased owned compute to support them.
  • Agentic harnesses like OpenClaw can increase AI usage by making models easier to apply to real work.
  • OpenClaw went from being relatively unknown to being cited by the CEO of the world’s most valuable company within roughly 45 days.
  • Jensen Huang compared OpenClaw to a web browser in importance and said every company needs an OpenClaw strategy.

Shift To Inference Led Ai Economy

  • AI workloads are shifting from training-dominant compute toward inference-dominant compute as models are run continuously at scale.
  • As tasks move from simple chat to reasoning and agentic workflows, token usage per task rises sharply, increasing inference compute demand.
  • NVIDIA GTC messaging emphasized AI inference as the key accelerator for downstream AI applications.
  • A strong signal from GTC is that the AI economy is shifting from a training-led phase to an inference-led phase.

Capacity Expectations Backlog And Supply Bottleneck Watch

  • A key open question identified is whether compute supply will catch up with demand and where and when bottlenecks will emerge.
  • NVIDIA’s committed order backlog was described as increasing from roughly $500B to roughly $1T with products booked out to around 2027.

Hardware Strategy Claims Heterogeneous Inference Acceleration

  • The episode claims a combined NVIDIA-and-Groq approach could deliver roughly a 35x improvement in throughput per megawatt versus NVIDIA’s current generation.
  • The episode asserts NVIDIA made a major deal involving Groq to address inference needs and plans to combine NVIDIA GPUs with Groq technology for higher inference efficiency.

Watchlist

  • A key open question identified is whether compute supply will catch up with demand and where and when bottlenecks will emerge.

Unknowns

  • What is the actual industry-wide mix of training vs inference compute and spend, and how fast is it changing?
  • Are the quoted NVIDIA backlog figures (scale and booked-out timeline) accurate and comparable to standard backlog definitions?
  • Did NVIDIA make a major deal involving Groq, and if so, what is the product/architecture scope and commercialization timeline?
  • What measured perf-per-watt and cost-per-token improvements are achievable for real-world inference workloads (separately for pre-fill and decode)?
  • How representative are the reported token usage levels and growth rates (hundreds of millions of tokens/day) across organizations adopting agent harnesses?

Investor overlay

Read-throughs

  • Inference and agent usage scaling may shift AI spend toward serving infrastructure and operational tooling such as model registries, routing, budgeting, and governance controls as gating constraints rather than model training alone.
  • If compute supply lags demand, constraints could appear first in inference capacity availability and energy efficiency, making throughput per megawatt and cost per token key competitive differentiators across inference stacks.
  • If heterogeneous inference acceleration claims prove real, a combined NVIDIA and Groq style approach could change competitive dynamics in inference throughput per megawatt, but the summary flags this as uncorroborated.

What would confirm

  • Public metrics showing inference spend and utilization rising faster than training, including disclosed mix shifts, sustained high token volumes, or stronger growth tied to agentic workloads.
  • Evidence that budgeting, governance, and orchestration are deployment bottlenecks, such as widespread adoption of model registries, policy controls, routing systems, and explicit cost governance programs tied to scaling usage.
  • Primary source confirmation and third party benchmarks validating any NVIDIA and Groq partnership scope and measured perf per watt and cost per token gains for real inference workloads, separately for pre fill and decode.

What would kill

  • Data showing training remains the dominant driver of compute spend and capacity, with no sustained shift toward inference led workloads at scale.
  • Clear signs that compute supply has caught up, with no meaningful inference capacity bottlenecks and diminishing relevance of throughput per megawatt as a differentiator.
  • No verifiable NVIDIA and Groq deal, or benchmarks showing no material perf per watt or cost per token improvement in real world inference compared with current generation alternatives.

Sources

  1. 2026-03-20 exponentialview.co