Tests As The Primary Control Plane For Agent-Written Code

Issue 73 Edition 2026-03-14 8 min read

General

Sources: 1 • Confidence: Medium • Updated: 2026-03-15 09:33

Key takeaways

Conformance-driven development can be done by using an LLM to derive a shared test suite from multiple existing implementations and then implementing a new system to satisfy that suite.
A newly emerging practice is to have agents produce code that humans neither write nor read.
The presenter reports often running Claude locally with permission safeguards disabled for convenience and attempts to mitigate risk by avoiding untrusted repository instructions.
Low-quality agent output is partly a controllable choice because iterative prompting for refactoring can yield code quality exceeding what a time-constrained human would produce.
AI-assisted programming is reducing demand for reusable UI component libraries because custom components can be generated on demand.

Sections

Tests As The Primary Control Plane For Agent-Written Code

Conformance-driven development can be done by using an LLM to derive a shared test suite from multiple existing implementations and then implementing a new system to satisfy that suite.
In agent-assisted coding workflows, tests are effectively no longer optional because agents can generate and iterate on tests at near-zero human cost.
Starting an agent coding session by telling it how to run tests and to follow red-green TDD increases the likelihood the agent produces working code.
Having agents perform manual end-to-end checks such as starting a server and using curl can catch failures that a passing automated test suite misses (e.g., a server not booting).
The presenter built a tool called Showboat that records an agent’s manual testing steps into a Markdown document including commands run and their outputs.
Testing against production user data should be avoided in favor of agent-assisted mocking and synthetic data generation for edge cases.

Agentic Adoption Stages And Workflow Delegation

A newly emerging practice is to have agents produce code that humans neither write nor read.
AI tool adoption for programmers tends to progress from asking chatbots questions to using coding agents that eventually write more code than the programmer does.
A proposed trust model for AI output is to treat it like an internal service: rely on interfaces and documentation, and inspect internals mainly when failures occur.
Claude Code combined with Sonnet 3.5 is described as an inflection point that made terminal-driving coding agents feel useful enough to do real work.
The presenter reports that model reliability has reached a point where they can often one-shot small engineering changes with short prompts and predict outcomes confidently.

Security Model For Agents: Containment Over Sanitization Analogies

The presenter reports often running Claude locally with permission safeguards disabled for convenience and attempts to mitigate risk by avoiding untrusted repository instructions.
The presenter disputes the term 'prompt injection' as misleading because there is no reliable way to separate untrusted data from trusted instructions in LLM prompting in the way SQL parameterization separates data from code.
A catastrophic exfiltration risk arises when an LLM has access to private data, is exposed to malicious instructions, and has an exfiltration channel to send information to an attacker.
Safely running coding agents depends primarily on sandboxing so that a compromised or misled agent has limited ability to cause harm.

Code Quality As An Adjustable Parameter Via Iteration And Scaffolding

Low-quality agent output is partly a controllable choice because iterative prompting for refactoring can yield code quality exceeding what a time-constrained human would produce.
Coding agents strongly replicate existing codebase patterns and templates, so maintaining a high-quality baseline and exemplar tests causes agents to extend the project in that same style.
Whether code quality matters depends on context: short-lived single-page tools can tolerate low-quality code, while long-term maintained systems require higher code quality.

Ecosystem Impacts: Components And Open-Source Maintenance Load

AI-assisted programming is reducing demand for reusable UI component libraries because custom components can be generated on demand.
Open source projects are being flooded with low-quality automated pull requests.

Watchlist

A newly emerging practice is to have agents produce code that humans neither write nor read.
The presenter reports often running Claude locally with permission safeguards disabled for convenience and attempts to mitigate risk by avoiding untrusted repository instructions.
AI-assisted programming is reducing demand for reusable UI component libraries because custom components can be generated on demand.
Open source projects are being flooded with low-quality automated pull requests.

Unknowns

What are measured defect rates, rollback rates, and time-to-fix for agent-generated routine features compared with human-written features under similar constraints?
How common are 'no human reads the code' pipelines in practice, and what compensating controls (tests, runtime monitoring, sandboxing, audits) correlate with acceptable incident rates?
How effective is sandboxing in real deployments at preventing the harms highlighted by the exfiltration threat model, including egress pathways and secret access?
What is the prevalence of developers disabling permission safeguards for convenience, and how strongly does that behavior correlate with security incidents or near-misses?
How much do exemplar templates, tests, and codebase conventions measurably influence downstream agent output quality and maintainability across different projects and models?

Investor overlay

Read-throughs

Growing use of tests as the primary control plane for agent-written code could increase demand for tooling that generates, runs, and audits conformance test suites and turns agent actions into reviewable runbooks.
Shift toward delegated execution and no-human-reads-the-code pipelines could raise demand for runtime monitoring, sandboxing, and containment tooling as compensating controls for exfiltration and automation risk.
On-demand generation of custom UI components could reduce reliance on reusable UI component libraries and increase emphasis on internal scaffolding such as templates, exemplars, and tests to shape agent output quality.

What would confirm

More teams reporting agent workflows where tests define the behavioral spec, including deriving shared test suites from existing implementations and using those suites to validate new systems.
Operational adoption of sandboxing and containment practices tied to agent execution, plus increased use of auditable runbooks for agent actions and expanded runtime monitoring as standard controls.
Evidence of declining importance of reusable UI component libraries alongside increased generation of bespoke components, and stronger investment in project conventions and templates that measurably improve downstream agent code quality.

What would kill

Measured outcomes show agent-generated features have higher defect or rollback rates or longer time-to-fix than human-written features even with strong test-driven workflows and manual smoke checks.
Sandboxing proves ineffective in real deployments at preventing exfiltration due to common egress paths or secret access, or widespread disabling of permission safeguards correlates with more security incidents.
No sustained reduction in demand for reusable UI component libraries, or maintainers and teams reject automated generation due to quality issues and increased maintenance burden from low-quality automated pull requests.

Sources

My fireside chat about agentic engineering at the Pragmatic Summit

2026-03-14 simonwillison.net