Agent Minimal Architecture (Harness + Prompt + Tools Loop)
Sources: 1 • Confidence: Medium • Updated: 2026-04-13 03:58
Key takeaways
- After the LLM emits a tool call, the harness extracts and executes it and then feeds the tool result back to the model in a follow-up prompt.
- LLMs operate on integer tokens rather than words, and providers price and limit usage based on tokens processed per request.
- Many coding agents expose numerous tools, with especially powerful ones enabling code execution such as Bash and Python runners.
- Reasoning modes introduced in 2025 allocate extra time and tokens to generate intermediate problem-solving text before producing the final answer.
- Providers may discount cached input tokens when a shared prompt prefix is reused soon after, enabling infrastructure to reuse prior computations.
Sections
Agent Minimal Architecture (Harness + Prompt + Tools Loop)
- After the LLM emits a tool call, the harness extracts and executes it and then feeds the tool result back to the model in a follow-up prompt.
- Coding agents typically prepend a hidden system prompt that instructs the model how to behave and how to use tools, and it can be very long.
- A coding agent is a software harness around an LLM that extends it using hidden prompts and callable tools.
- Most of a coding agent can be described as an LLM plus a system prompt plus tools running in a loop.
- Tools are functions exposed by the agent harness that the LLM can invoke using a specified calling format embedded in its output.
Tokenization Drives Cost, Limits, And State Management
- LLMs operate on integer tokens rather than words, and providers price and limit usage based on tokens processed per request.
- LLMs are stateless, so maintaining a conversation requires replaying the entire prior transcript in each new prompt.
- Because providers charge for both input and output tokens, longer conversations become more expensive as the input token count grows.
- Chat interfaces are implemented as specially formatted completion prompts that simulate a conversation.
Tool Power Expands Capabilities And Enlarges Risk Surface
- After the LLM emits a tool call, the harness extracts and executes it and then feeds the tool result back to the model in a follow-up prompt.
- Many coding agents expose numerous tools, with especially powerful ones enabling code execution such as Bash and Python runners.
- Tools are functions exposed by the agent harness that the LLM can invoke using a specified calling format embedded in its output.
Reasoning Modes Trade Higher Compute For Better Outcomes On Complex Tasks
- Reasoning modes introduced in 2025 allocate extra time and tokens to generate intermediate problem-solving text before producing the final answer.
- Many coding agents let users adjust the reasoning effort level to encourage more computation on harder problems.
- Reasoning is particularly helpful for debugging because it supports navigating complex code paths while interleaving tool calls to trace issues.
Prefix Caching As A Cost/Latency Lever And A Ux Constraint
- Providers may discount cached input tokens when a shared prompt prefix is reused soon after, enabling infrastructure to reuse prior computations.
- Coding agents often avoid modifying earlier conversation content to maximize token-cache reuse efficiency.
Unknowns
- What are the concrete pricing schedules (input/output token rates, context-window limits) for the providers/models relevant to the intended agent workload?
- How large are cached-prefix discounts in practice, what are the cache validity conditions (time window, prefix similarity), and how sensitive is billing to small prefix edits?
- How often do vendors change hidden system prompts, and what regression/observability mechanisms exist to detect behavior changes when they do?
- What is the quantitative impact of reasoning modes (success rate, latency, token consumption) across representative coding tasks, especially debugging?
- What sandboxing and credential-handling controls are used when agents have Bash/Python execution tools, and what failure/abuse cases are most common?