Rosa Del Mar

Daily Brief

Issue 75 2026-03-16

Agent Architecture As Orchestration Around An Llm

Issue 75 Edition 2026-03-16 6 min read
General
Sources: 1 • Confidence: Medium • Updated: 2026-04-12 10:25

Key takeaways

  • After an LLM emits a tool call, the harness extracts and executes it and then feeds the tool result back to the model in a follow-up prompt.
  • LLMs operate on integer tokens rather than words, and providers price and limit usage based on tokens processed per request.
  • Reasoning modes introduced in 2025 allocate extra time and tokens to generate intermediate problem-solving text before producing a final answer.
  • Some providers discount cached input tokens when a shared prompt prefix is reused soon after.
  • Multimodal vision inputs are converted into tokens and processed by the model similarly to text rather than via a separate OCR step.

Sections

Agent Architecture As Orchestration Around An Llm

  • After an LLM emits a tool call, the harness extracts and executes it and then feeds the tool result back to the model in a follow-up prompt.
  • Coding agents typically prepend a hidden system prompt that instructs the model how to behave and how to use tools, and it can be very long.
  • A coding agent is a software harness around an LLM that extends it using hidden prompts and callable tools.
  • In coding agents, tools are functions exposed by the harness that the LLM can invoke using a specified calling format embedded in its output.
  • A minimal description of most coding agents is an LLM plus a system prompt plus tools running in a loop.

Tokenization, Statelessness, And Cost Growth With Context Length

  • LLMs operate on integer tokens rather than words, and providers price and limit usage based on tokens processed per request.
  • LLMs are stateless, so maintaining a conversation requires replaying the entire prior transcript in each new prompt.
  • Providers charge for both input and output tokens, so longer conversations become more expensive as input token count grows.
  • Chat interfaces are implemented as specially formatted completion prompts that simulate a conversation.

Reasoning Modes As Adjustable Compute For Harder Tasks

  • Reasoning modes introduced in 2025 allocate extra time and tokens to generate intermediate problem-solving text before producing a final answer.
  • Many coding agents let users adjust the reasoning effort level to encourage more computation on harder problems.
  • Reasoning is particularly helpful for debugging because it supports navigating complex code paths while interleaving tool calls to trace issues.

Prompt-Prefix Caching As A Cost/Latency Lever

  • Some providers discount cached input tokens when a shared prompt prefix is reused soon after.
  • Coding agents often avoid modifying earlier conversation content to maximize token-cache reuse efficiency.

Multimodal Inputs As Token Streams

  • Multimodal vision inputs are converted into tokens and processed by the model similarly to text rather than via a separate OCR step.

Unknowns

  • Which specific providers/models offer cached-token discounts, and what are the precise cache-hit rules (time windows, prefix matching tolerance, system-prompt inclusion)?
  • What are the actual token-based pricing schedules (input vs output), context window limits, and how they vary across providers for representative coding-agent traces?
  • How often do tool-call loops fail in practice (parsing errors, execution errors, incorrect tool selection), and what observability/guardrails are used to mitigate them?
  • How are code-execution tools sandboxed (filesystem, network, credentials), and what is the incident profile for destructive commands or data exfiltration in real deployments?
  • What is the empirical effect of reasoning modes and adjustable effort on success rate, latency, and tokens for debugging and other coding tasks?

Investor overlay

Read-throughs

  • Orchestration layers and tool harnesses may be key value capture in agent products, shifting differentiation from base models to tooling, prompt policy, and observability.
  • Token economics and context replay make cost per turn rise with conversation length, favoring product designs that compress memory, stabilize prompts, or limit context growth.
  • Adjustable reasoning effort and prompt prefix caching create new levers to trade cost and latency for quality, implying spending and UX will become policy-tunable per task.

What would confirm

  • Providers publish or standardize cached-token discount rules, leading developers to adopt stable prompt scaffolds and avoid editing earlier turns to preserve cacheability.
  • Benchmarks or field data show measurable changes in success rate, latency, and token usage when switching reasoning modes or effort levels for debugging and coding tasks.
  • Widespread deployment of tooling guardrails such as parsing validation, execution sandboxing, and tool-call observability to reduce loop failures and incidents.

What would kill

  • Caching discounts are rare, inconsistent, or too constrained to influence agent design, making prompt immutability economically irrelevant.
  • Reasoning modes add substantial cost and latency without consistent quality gains on representative agent tasks, reducing willingness to use adjustable effort.
  • Tool-call loops exhibit high real-world failure or incident rates without effective mitigations, limiting practical adoption of orchestration-heavy agent architectures.

Sources

  1. 2026-03-16 simonwillison.net