Agent Architecture As Orchestration Around An Llm

Issue 75 Edition 2026-03-16 6 min read

General

Sources: 1 • Confidence: Medium • Updated: 2026-04-12 10:25

Key takeaways

After an LLM emits a tool call, the harness extracts and executes it and then feeds the tool result back to the model in a follow-up prompt.
LLMs operate on integer tokens rather than words, and providers price and limit usage based on tokens processed per request.
Reasoning modes introduced in 2025 allocate extra time and tokens to generate intermediate problem-solving text before producing a final answer.
Some providers discount cached input tokens when a shared prompt prefix is reused soon after.
Multimodal vision inputs are converted into tokens and processed by the model similarly to text rather than via a separate OCR step.

After an LLM emits a tool call, the harness extracts and executes it and then feeds the tool result back to the model in a follow-up prompt.
Coding agents typically prepend a hidden system prompt that instructs the model how to behave and how to use tools, and it can be very long.
A coding agent is a software harness around an LLM that extends it using hidden prompts and callable tools.
In coding agents, tools are functions exposed by the harness that the LLM can invoke using a specified calling format embedded in its output.
A minimal description of most coding agents is an LLM plus a system prompt plus tools running in a loop.

LLMs operate on integer tokens rather than words, and providers price and limit usage based on tokens processed per request.
LLMs are stateless, so maintaining a conversation requires replaying the entire prior transcript in each new prompt.
Providers charge for both input and output tokens, so longer conversations become more expensive as input token count grows.
Chat interfaces are implemented as specially formatted completion prompts that simulate a conversation.

Reasoning modes introduced in 2025 allocate extra time and tokens to generate intermediate problem-solving text before producing a final answer.
Many coding agents let users adjust the reasoning effort level to encourage more computation on harder problems.
Reasoning is particularly helpful for debugging because it supports navigating complex code paths while interleaving tool calls to trace issues.

Some providers discount cached input tokens when a shared prompt prefix is reused soon after.
Coding agents often avoid modifying earlier conversation content to maximize token-cache reuse efficiency.

Multimodal vision inputs are converted into tokens and processed by the model similarly to text rather than via a separate OCR step.

Which specific providers/models offer cached-token discounts, and what are the precise cache-hit rules (time windows, prefix matching tolerance, system-prompt inclusion)?
What are the actual token-based pricing schedules (input vs output), context window limits, and how they vary across providers for representative coding-agent traces?
How often do tool-call loops fail in practice (parsing errors, execution errors, incorrect tool selection), and what observability/guardrails are used to mitigate them?
How are code-execution tools sandboxed (filesystem, network, credentials), and what is the incident profile for destructive commands or data exfiltration in real deployments?
What is the empirical effect of reasoning modes and adjustable effort on success rate, latency, and tokens for debugging and other coding tasks?

Orchestration layers and tool harnesses may be key value capture in agent products, shifting differentiation from base models to tooling, prompt policy, and observability.
Token economics and context replay make cost per turn rise with conversation length, favoring product designs that compress memory, stabilize prompts, or limit context growth.
Adjustable reasoning effort and prompt prefix caching create new levers to trade cost and latency for quality, implying spending and UX will become policy-tunable per task.

Providers publish or standardize cached-token discount rules, leading developers to adopt stable prompt scaffolds and avoid editing earlier turns to preserve cacheability.
Benchmarks or field data show measurable changes in success rate, latency, and token usage when switching reasoning modes or effort levels for debugging and coding tasks.
Widespread deployment of tooling guardrails such as parsing validation, execution sandboxing, and tool-call observability to reduce loop failures and incidents.

Caching discounts are rare, inconsistent, or too constrained to influence agent design, making prompt immutability economically irrelevant.
Reasoning modes add substantial cost and latency without consistent quality gains on representative agent tasks, reducing willingness to use adjustable effort.
Tool-call loops exhibit high real-world failure or incident rates without effective mitigations, limiting practical adoption of orchestration-heavy agent architectures.