Agent Minimal Architecture (Harness + Prompt + Tools Loop)

Issue 75 Edition 2026-03-16 7 min read

General

Sources: 1 • Confidence: Medium • Updated: 2026-04-13 03:58

Key takeaways

After the LLM emits a tool call, the harness extracts and executes it and then feeds the tool result back to the model in a follow-up prompt.
LLMs operate on integer tokens rather than words, and providers price and limit usage based on tokens processed per request.
Many coding agents expose numerous tools, with especially powerful ones enabling code execution such as Bash and Python runners.
Reasoning modes introduced in 2025 allocate extra time and tokens to generate intermediate problem-solving text before producing the final answer.
Providers may discount cached input tokens when a shared prompt prefix is reused soon after, enabling infrastructure to reuse prior computations.

After the LLM emits a tool call, the harness extracts and executes it and then feeds the tool result back to the model in a follow-up prompt.
Coding agents typically prepend a hidden system prompt that instructs the model how to behave and how to use tools, and it can be very long.
A coding agent is a software harness around an LLM that extends it using hidden prompts and callable tools.
Most of a coding agent can be described as an LLM plus a system prompt plus tools running in a loop.
Tools are functions exposed by the agent harness that the LLM can invoke using a specified calling format embedded in its output.

LLMs operate on integer tokens rather than words, and providers price and limit usage based on tokens processed per request.
LLMs are stateless, so maintaining a conversation requires replaying the entire prior transcript in each new prompt.
Because providers charge for both input and output tokens, longer conversations become more expensive as the input token count grows.
Chat interfaces are implemented as specially formatted completion prompts that simulate a conversation.

After the LLM emits a tool call, the harness extracts and executes it and then feeds the tool result back to the model in a follow-up prompt.
Many coding agents expose numerous tools, with especially powerful ones enabling code execution such as Bash and Python runners.
Tools are functions exposed by the agent harness that the LLM can invoke using a specified calling format embedded in its output.

Reasoning modes introduced in 2025 allocate extra time and tokens to generate intermediate problem-solving text before producing the final answer.
Many coding agents let users adjust the reasoning effort level to encourage more computation on harder problems.
Reasoning is particularly helpful for debugging because it supports navigating complex code paths while interleaving tool calls to trace issues.

Providers may discount cached input tokens when a shared prompt prefix is reused soon after, enabling infrastructure to reuse prior computations.
Coding agents often avoid modifying earlier conversation content to maximize token-cache reuse efficiency.

What are the concrete pricing schedules (input/output token rates, context-window limits) for the providers/models relevant to the intended agent workload?
How large are cached-prefix discounts in practice, what are the cache validity conditions (time window, prefix similarity), and how sensitive is billing to small prefix edits?
How often do vendors change hidden system prompts, and what regression/observability mechanisms exist to detect behavior changes when they do?
What is the quantitative impact of reasoning modes (success rate, latency, token consumption) across representative coding tasks, especially debugging?
What sandboxing and credential-handling controls are used when agents have Bash/Python execution tools, and what failure/abuse cases are most common?

Token based pricing and stateless replay make context management a cost driver, creating demand for tooling that compresses, summarizes, and manages long sessions to control marginal token spend.
Prefix caching discounts could reward stable prompt prefixes, implying value for agent designs and harnesses that minimize early prompt edits to preserve cacheability and reduce latency and cost.
Expanded tool access including Bash and Python runners increases the risk surface, implying demand for sandboxing, credential handling, and observability controls integrated into agent harnesses.

Providers publish concrete token pricing, context limits, and cached prefix discount rules, and buyers report meaningful savings when keeping prompt prefixes stable.
Benchmarks show reasoning modes materially improve success rate on complex coding and debugging tasks with quantifiable token and latency tradeoffs that users can tune.
Incidents or audits highlight failures from tool execution access, and platforms respond by shipping stronger sandboxing, credential isolation, and tool call validation in agent frameworks.

Cached prefix discounts are negligible, unreliable, or too sensitive to minor edits, so prompt stability does not produce repeatable cost or latency benefits.
Reasoning modes do not improve outcomes enough versus added tokens and latency, reducing adoption for coding agents.
Tool execution risks are addressed primarily by restricting tools rather than improving controls, limiting the need for dedicated sandboxing and credential management layers.