Tokenization And Statelessness Drive Context Management And Unit Economics

Issue 75 Edition 2026-03-16 7 min read

General

Sources: 1 • Confidence: Medium • Updated: 2026-03-17 15:15

Key takeaways

LLMs operate on integer tokens rather than words, and providers price and limit usage based on tokens processed per request.
After an LLM emits a tool call, the harness extracts and executes it and then feeds the tool result back to the model in a follow-up prompt.
Reasoning modes introduced in 2025 allocate extra time and tokens to generate intermediate problem-solving text before producing the final answer.
In multimodal setups, vision inputs are converted into tokens and processed similarly to text rather than via a separate OCR step.
Many coding agents expose numerous tools, including powerful tools that enable code execution such as Bash and Python runners.

LLMs operate on integer tokens rather than words, and providers price and limit usage based on tokens processed per request.
LLMs are stateless, so maintaining a conversation requires replaying the entire prior transcript in each new prompt.
Providers charge for both input and output tokens, so longer conversations become more expensive as the input transcript grows.
Chat interfaces can be implemented as specially formatted completion prompts that simulate a conversation.
Providers may discount cached input tokens when a shared prompt prefix is reused soon after.
Coding agents often avoid modifying earlier conversation content to maximize token-cache reuse efficiency.

After an LLM emits a tool call, the harness extracts and executes it and then feeds the tool result back to the model in a follow-up prompt.
Coding agents typically prepend a hidden system prompt that instructs the model how to behave and how to use tools, and it can be very long.
A coding agent is a software harness around an LLM that extends it using hidden prompts and callable tools.
In an agent, tools are functions exposed by the harness that the LLM can invoke using a specified calling format embedded in its output.
A large portion of a coding agent can be described as an LLM plus a system prompt plus tools running in a loop.

Reasoning modes introduced in 2025 allocate extra time and tokens to generate intermediate problem-solving text before producing the final answer.
Many coding agents let users adjust the reasoning effort level to encourage more computation on harder problems.
Reasoning is particularly helpful for debugging because it supports navigating complex code paths while interleaving tool calls to trace issues.

In multimodal setups, vision inputs are converted into tokens and processed similarly to text rather than via a separate OCR step.

Many coding agents expose numerous tools, including powerful tools that enable code execution such as Bash and Python runners.

What are the concrete token pricing schedules, context window limits, and tokenization behaviors (including for images) for the providers/models relevant to the intended deployments?
Under what exact conditions do providers apply cached-token discounts, and what is the realized cache hit rate in typical coding-agent sessions?
How do different context-management strategies (full replay, summarization, retrieval/memory) affect task success, cost, and failure modes in long agent sessions?
What is the real-world reliability of tool-call parsing/execution loops (parse errors, tool misuse, cascading failures), and which observability metrics best detect these issues early?
What sandboxing/permission models are in place for code-execution tools, and what is the incident rate for destructive commands, data exfiltration, or credential leakage?

Token based pricing plus stateless replay makes unit economics hinge on context management and caching. Products that enforce stable prompt prefixes and immutability could show better gross margins and lower inference cost per task, if cache discounts are material and hit rates are high.
Agent value may shift from model choice to orchestration quality. Vendors with robust harness design, reliable tool call execution, and strong observability could reduce failure rates and rework, lowering support costs and improving task success in long sessions.
Reasoning effort settings trade latency and token spend for output quality and debugging utility. Vendors that tune effort policies well could improve task success per dollar, but only if incremental tokens materially reduce failures and tool loop churn.

Provider disclosures or customer telemetry showing concrete token pricing, context window limits, and meaningful cached token discounts, alongside measured cache hit rates in real coding agent sessions.
Benchmarked comparisons of context strategies showing summarization or retrieval maintains or improves task success while reducing total tokens and avoiding new failure modes in long running agent loops.
Operational metrics showing low tool call parse error rates, low tool misuse, and early warning observability signals that correlate with reduced cascading failures and lower incident response burden.

Pricing or technical constraints that make cached token discounts rare or immaterial, with low cache hit rates even when prompt prefixes are stabilized, preventing expected unit cost improvements.
Evidence that summarization or retrieval driven context management materially increases failure rates or degrades task outcomes versus full replay, negating cost savings through rework or repeated runs.
High frequency security incidents or destructive commands in code execution tools, or weak sandboxing and permission models that force disabling powerful tools, reducing agent usefulness and adoption.