Rosa Del Mar

Daily Brief

Issue 75 2026-03-16

Agent Minimal Architecture (Harness + Prompt + Tools Loop)

Issue 75 Edition 2026-03-16 7 min read
General
Sources: 1 • Confidence: Medium • Updated: 2026-04-13 03:58

Key takeaways

  • After the LLM emits a tool call, the harness extracts and executes it and then feeds the tool result back to the model in a follow-up prompt.
  • LLMs operate on integer tokens rather than words, and providers price and limit usage based on tokens processed per request.
  • Many coding agents expose numerous tools, with especially powerful ones enabling code execution such as Bash and Python runners.
  • Reasoning modes introduced in 2025 allocate extra time and tokens to generate intermediate problem-solving text before producing the final answer.
  • Providers may discount cached input tokens when a shared prompt prefix is reused soon after, enabling infrastructure to reuse prior computations.

Sections

Agent Minimal Architecture (Harness + Prompt + Tools Loop)

  • After the LLM emits a tool call, the harness extracts and executes it and then feeds the tool result back to the model in a follow-up prompt.
  • Coding agents typically prepend a hidden system prompt that instructs the model how to behave and how to use tools, and it can be very long.
  • A coding agent is a software harness around an LLM that extends it using hidden prompts and callable tools.
  • Most of a coding agent can be described as an LLM plus a system prompt plus tools running in a loop.
  • Tools are functions exposed by the agent harness that the LLM can invoke using a specified calling format embedded in its output.

Tokenization Drives Cost, Limits, And State Management

  • LLMs operate on integer tokens rather than words, and providers price and limit usage based on tokens processed per request.
  • LLMs are stateless, so maintaining a conversation requires replaying the entire prior transcript in each new prompt.
  • Because providers charge for both input and output tokens, longer conversations become more expensive as the input token count grows.
  • Chat interfaces are implemented as specially formatted completion prompts that simulate a conversation.

Tool Power Expands Capabilities And Enlarges Risk Surface

  • After the LLM emits a tool call, the harness extracts and executes it and then feeds the tool result back to the model in a follow-up prompt.
  • Many coding agents expose numerous tools, with especially powerful ones enabling code execution such as Bash and Python runners.
  • Tools are functions exposed by the agent harness that the LLM can invoke using a specified calling format embedded in its output.

Reasoning Modes Trade Higher Compute For Better Outcomes On Complex Tasks

  • Reasoning modes introduced in 2025 allocate extra time and tokens to generate intermediate problem-solving text before producing the final answer.
  • Many coding agents let users adjust the reasoning effort level to encourage more computation on harder problems.
  • Reasoning is particularly helpful for debugging because it supports navigating complex code paths while interleaving tool calls to trace issues.

Prefix Caching As A Cost/Latency Lever And A Ux Constraint

  • Providers may discount cached input tokens when a shared prompt prefix is reused soon after, enabling infrastructure to reuse prior computations.
  • Coding agents often avoid modifying earlier conversation content to maximize token-cache reuse efficiency.

Unknowns

  • What are the concrete pricing schedules (input/output token rates, context-window limits) for the providers/models relevant to the intended agent workload?
  • How large are cached-prefix discounts in practice, what are the cache validity conditions (time window, prefix similarity), and how sensitive is billing to small prefix edits?
  • How often do vendors change hidden system prompts, and what regression/observability mechanisms exist to detect behavior changes when they do?
  • What is the quantitative impact of reasoning modes (success rate, latency, token consumption) across representative coding tasks, especially debugging?
  • What sandboxing and credential-handling controls are used when agents have Bash/Python execution tools, and what failure/abuse cases are most common?

Investor overlay

Read-throughs

  • Token based pricing and stateless replay make context management a cost driver, creating demand for tooling that compresses, summarizes, and manages long sessions to control marginal token spend.
  • Prefix caching discounts could reward stable prompt prefixes, implying value for agent designs and harnesses that minimize early prompt edits to preserve cacheability and reduce latency and cost.
  • Expanded tool access including Bash and Python runners increases the risk surface, implying demand for sandboxing, credential handling, and observability controls integrated into agent harnesses.

What would confirm

  • Providers publish concrete token pricing, context limits, and cached prefix discount rules, and buyers report meaningful savings when keeping prompt prefixes stable.
  • Benchmarks show reasoning modes materially improve success rate on complex coding and debugging tasks with quantifiable token and latency tradeoffs that users can tune.
  • Incidents or audits highlight failures from tool execution access, and platforms respond by shipping stronger sandboxing, credential isolation, and tool call validation in agent frameworks.

What would kill

  • Cached prefix discounts are negligible, unreliable, or too sensitive to minor edits, so prompt stability does not produce repeatable cost or latency benefits.
  • Reasoning modes do not improve outcomes enough versus added tokens and latency, reducing adoption for coding agents.
  • Tool execution risks are addressed primarily by restricting tools rather than improving controls, limiting the need for dedicated sandboxing and credential management layers.

Sources

  1. 2026-03-16 simonwillison.net