Agent Architecture As Orchestration Around An Llm
Sources: 1 • Confidence: Medium • Updated: 2026-04-12 10:25
Key takeaways
- After an LLM emits a tool call, the harness extracts and executes it and then feeds the tool result back to the model in a follow-up prompt.
- LLMs operate on integer tokens rather than words, and providers price and limit usage based on tokens processed per request.
- Reasoning modes introduced in 2025 allocate extra time and tokens to generate intermediate problem-solving text before producing a final answer.
- Some providers discount cached input tokens when a shared prompt prefix is reused soon after.
- Multimodal vision inputs are converted into tokens and processed by the model similarly to text rather than via a separate OCR step.
Sections
Agent Architecture As Orchestration Around An Llm
- After an LLM emits a tool call, the harness extracts and executes it and then feeds the tool result back to the model in a follow-up prompt.
- Coding agents typically prepend a hidden system prompt that instructs the model how to behave and how to use tools, and it can be very long.
- A coding agent is a software harness around an LLM that extends it using hidden prompts and callable tools.
- In coding agents, tools are functions exposed by the harness that the LLM can invoke using a specified calling format embedded in its output.
- A minimal description of most coding agents is an LLM plus a system prompt plus tools running in a loop.
Tokenization, Statelessness, And Cost Growth With Context Length
- LLMs operate on integer tokens rather than words, and providers price and limit usage based on tokens processed per request.
- LLMs are stateless, so maintaining a conversation requires replaying the entire prior transcript in each new prompt.
- Providers charge for both input and output tokens, so longer conversations become more expensive as input token count grows.
- Chat interfaces are implemented as specially formatted completion prompts that simulate a conversation.
Reasoning Modes As Adjustable Compute For Harder Tasks
- Reasoning modes introduced in 2025 allocate extra time and tokens to generate intermediate problem-solving text before producing a final answer.
- Many coding agents let users adjust the reasoning effort level to encourage more computation on harder problems.
- Reasoning is particularly helpful for debugging because it supports navigating complex code paths while interleaving tool calls to trace issues.
Prompt-Prefix Caching As A Cost/Latency Lever
- Some providers discount cached input tokens when a shared prompt prefix is reused soon after.
- Coding agents often avoid modifying earlier conversation content to maximize token-cache reuse efficiency.
Multimodal Inputs As Token Streams
- Multimodal vision inputs are converted into tokens and processed by the model similarly to text rather than via a separate OCR step.
Unknowns
- Which specific providers/models offer cached-token discounts, and what are the precise cache-hit rules (time windows, prefix matching tolerance, system-prompt inclusion)?
- What are the actual token-based pricing schedules (input vs output), context window limits, and how they vary across providers for representative coding-agent traces?
- How often do tool-call loops fail in practice (parsing errors, execution errors, incorrect tool selection), and what observability/guardrails are used to mitigate them?
- How are code-execution tools sandboxed (filesystem, network, credentials), and what is the incident profile for destructive commands or data exfiltration in real deployments?
- What is the empirical effect of reasoning modes and adjustable effort on success rate, latency, and tokens for debugging and other coding tasks?