Tokenization And Statelessness Drive Context Management And Unit Economics
Sources: 1 • Confidence: Medium • Updated: 2026-03-17 15:15
Key takeaways
- LLMs operate on integer tokens rather than words, and providers price and limit usage based on tokens processed per request.
- After an LLM emits a tool call, the harness extracts and executes it and then feeds the tool result back to the model in a follow-up prompt.
- Reasoning modes introduced in 2025 allocate extra time and tokens to generate intermediate problem-solving text before producing the final answer.
- In multimodal setups, vision inputs are converted into tokens and processed similarly to text rather than via a separate OCR step.
- Many coding agents expose numerous tools, including powerful tools that enable code execution such as Bash and Python runners.
Sections
Tokenization And Statelessness Drive Context Management And Unit Economics
- LLMs operate on integer tokens rather than words, and providers price and limit usage based on tokens processed per request.
- LLMs are stateless, so maintaining a conversation requires replaying the entire prior transcript in each new prompt.
- Providers charge for both input and output tokens, so longer conversations become more expensive as the input transcript grows.
- Chat interfaces can be implemented as specially formatted completion prompts that simulate a conversation.
- Providers may discount cached input tokens when a shared prompt prefix is reused soon after.
- Coding agents often avoid modifying earlier conversation content to maximize token-cache reuse efficiency.
Agent Architecture As Orchestration (Harness + System Prompt + Tools + Loop)
- After an LLM emits a tool call, the harness extracts and executes it and then feeds the tool result back to the model in a follow-up prompt.
- Coding agents typically prepend a hidden system prompt that instructs the model how to behave and how to use tools, and it can be very long.
- A coding agent is a software harness around an LLM that extends it using hidden prompts and callable tools.
- In an agent, tools are functions exposed by the harness that the LLM can invoke using a specified calling format embedded in its output.
- A large portion of a coding agent can be described as an LLM plus a system prompt plus tools running in a loop.
Capability Knobs And Tradeoffs (Reasoning Effort, Latency, Cost, Debugging Utility)
- Reasoning modes introduced in 2025 allocate extra time and tokens to generate intermediate problem-solving text before producing the final answer.
- Many coding agents let users adjust the reasoning effort level to encourage more computation on harder problems.
- Reasoning is particularly helpful for debugging because it supports navigating complex code paths while interleaving tool calls to trace issues.
Multimodal Inputs Treated As Tokens; Implications For Cost/Latency Budgeting
- In multimodal setups, vision inputs are converted into tokens and processed similarly to text rather than via a separate OCR step.
Risk Surface Expands With Code-Execution Tools
- Many coding agents expose numerous tools, including powerful tools that enable code execution such as Bash and Python runners.
Unknowns
- What are the concrete token pricing schedules, context window limits, and tokenization behaviors (including for images) for the providers/models relevant to the intended deployments?
- Under what exact conditions do providers apply cached-token discounts, and what is the realized cache hit rate in typical coding-agent sessions?
- How do different context-management strategies (full replay, summarization, retrieval/memory) affect task success, cost, and failure modes in long agent sessions?
- What is the real-world reliability of tool-call parsing/execution loops (parse errors, tool misuse, cascading failures), and which observability metrics best detect these issues early?
- What sandboxing/permission models are in place for code-execution tools, and what is the incident rate for destructive commands, data exfiltration, or credential leakage?