Verification And Observability As Primary Controls For Agent Output

Issue 73 Edition 2026-03-14 9 min read

General

Sources: 1 • Confidence: Medium • Updated: 2026-04-13 03:50

Key takeaways

Conformance-driven development is described as achievable by using an LLM to derive a shared test suite from multiple existing implementations and then implementing a new system to satisfy that suite.
AI tool adoption for programmers tends to progress from asking chatbots questions to using coding agents that eventually write more code than the programmer does.
Despite understanding the risks, Simon is described as often running Claude locally with permission safeguards disabled for convenience while attempting to mitigate by avoiding untrusted repository instructions.
Low-quality agent output is described as partly a controllable choice because iteratively prompting the agent to refactor can yield code quality that exceeds what a time-constrained human would produce.
AI-assisted programming is described as reducing demand for reusable UI component libraries because custom components can be generated on demand, while open source projects are described as being flooded with low-quality automated pull requests.

Sections

Verification And Observability As Primary Controls For Agent Output

Conformance-driven development is described as achievable by using an LLM to derive a shared test suite from multiple existing implementations and then implementing a new system to satisfy that suite.
Because agents can generate and iterate on tests at near-zero human cost, tests are described as effectively no longer optional in agent-assisted coding workflows.
Starting agent coding sessions by instructing how to run tests and to follow red-green TDD increases the likelihood of producing working code.
Having agents perform manual end-to-end checks (such as starting the server and using curl) can catch failures that a passing automated test suite misses, including the server not booting.
Simon built a tool called Showboat that records an agent’s manual testing steps into a Markdown document including commands run and their outputs.
Testing against production user data is described as something to avoid in favor of agent-assisted mocking and synthetic data generation that can create specific edge-case users on demand.

Shift From Assistance To Delegation In Programming

AI tool adoption for programmers tends to progress from asking chatbots questions to using coding agents that eventually write more code than the programmer does.
For familiar task classes, Simon reports high confidence that a strong model can reliably generate correct implementations such as a paginated JSON API against a database.
Claude Code combined with Sonnet 3.5 is described as a key inflection that made terminal-driving coding agents feel good enough to do useful work.
Model reliability is described as having reached a point where Simon can often one-shot small engineering changes with short prompts and predict outcomes confidently.

Agent Security Centers On Containment Not Prompt Sanitization

Despite understanding the risks, Simon is described as often running Claude locally with permission safeguards disabled for convenience while attempting to mitigate by avoiding untrusted repository instructions.
The term prompt injection is argued to be misleading because there is described to be no reliable way to separate untrusted data from trusted instructions in LLM prompting analogous to SQL parameterization.
A catastrophic prompt-attack risk is described to occur when an LLM has access to private data, is exposed to malicious instructions, and has an exfiltration channel to send stolen information to an attacker.
Safely running coding agents is described as depending primarily on sandboxing so that a compromised or misled agent has limited ability to cause harm.

Quality Is Contextual And Can Be Engineered Via Refactoring Loops And Scaffolding

Low-quality agent output is described as partly a controllable choice because iteratively prompting the agent to refactor can yield code quality that exceeds what a time-constrained human would produce.
Coding agents are described as strongly replicating existing codebase patterns and templates, so maintaining a high-quality baseline and a few exemplar tests causes agents to extend the project in that same style.
Whether code quality matters is described as context-dependent, with short-lived single-page tools tolerating low-quality code while long-term maintained systems require higher code quality.

Ecosystem Pressure On Component Markets And Open Source Maintenance

AI-assisted programming is described as reducing demand for reusable UI component libraries because custom components can be generated on demand, while open source projects are described as being flooded with low-quality automated pull requests.

Watchlist

A newly emerging practice is to have agents produce code that humans neither write nor read, and this practice is portrayed as potentially irresponsible even if some teams claim it works.
Despite understanding the risks, Simon is described as often running Claude locally with permission safeguards disabled for convenience while attempting to mitigate by avoiding untrusted repository instructions.
AI-assisted programming is described as reducing demand for reusable UI component libraries because custom components can be generated on demand, while open source projects are described as being flooded with low-quality automated pull requests.

Unknowns

What are the measured one-shot success rates and defect rates for agent-generated changes across a stable suite of tasks, models, and codebases?
How much does explicit test/TDD prompting change outcome quality compared to agents operating without those instructions?
How often do automated test suites pass while agent-produced systems still fail to boot or fail basic end-to-end behaviors, and what minimum smoke-test set mitigates this?
Do tools that document agent actions (such as Showboat-style transcripts) measurably improve auditability, debugging speed, or collaboration outcomes?
How effective is test-suite derivation from multiple implementations at detecting behavioral drift and preventing regressions over time in real projects?

Investor overlay

Read-throughs

Rising need for verification and observability tooling around agent-coded changes, centered on derived conformance test suites, smoke checks, and audit transcripts used as controls when humans do not read code.
Shift from assistive coding to delegated agent execution may increase demand for sandboxing and containment infrastructure as the primary security control, especially given operators disabling safeguards for convenience.
Ecosystem pressure could weaken demand for reusable UI component libraries if teams increasingly generate custom components on demand, while simultaneously increasing maintainer burden from low-quality automated pull requests.

What would confirm

Teams standardize workflows where agents generate tests first, maintain derived conformance suites, and run structured smoke checks with retained transcripts as required artifacts for merges and releases.
Increased adoption of sandboxed execution for coding agents, with policies emphasizing containment over prompt sanitization, and tooling that limits sensitive access and outbound connectivity by default.
Evidence that organizations reduce usage or procurement of reusable UI component libraries due to reliable on-demand generation, alongside repository maintainers reporting sustained increases in automated low-quality PR volume.

What would kill

Measured benchmarks show high one-shot success and low defect rates without heavy test prompting, reducing the need for additional verification layers beyond conventional CI tests.
Operational data shows automated tests and minimal smoke tests reliably catch runtime failures, and audit transcripts do not improve debugging or collaboration outcomes, limiting demand for observability artifacts.
UI component libraries remain sticky because generated components are costly to maintain or inconsistent with design systems, and open source projects successfully deter low-quality automated PRs with low incremental burden.

Sources

My fireside chat about agentic engineering at the Pragmatic Summit

2026-03-14 simonwillison.net