Testing As The Primary Control Surface In Agentic Engineering

Issue 73 Edition 2026-03-14 9 min read

General

Sources: 1 • Confidence: Medium • Updated: 2026-04-12 10:16

Key takeaways

In the episode, Simon described conformance-driven development using an LLM to derive a shared test suite from multiple existing implementations and then implementing a new system to satisfy that suite.
In the episode, Simon characterized an emerging practice where agents produce code that humans neither write nor read, and suggested this may be irresponsible even if some teams claim it works.
In the episode, Simon stated that he often runs Claude locally with permission safeguards disabled for convenience and tries to mitigate risk by avoiding untrusted repository instructions.
In the episode, Simon asserted that low-quality agent output is partly a controllable choice because iteratively prompting an agent to refactor can yield code quality exceeding what a time-constrained human would produce.
In the episode, Simon suggested that AI-assisted programming is reducing demand for reusable UI component libraries because custom components can be generated on demand.

Sections

Testing As The Primary Control Surface In Agentic Engineering

In the episode, Simon described conformance-driven development using an LLM to derive a shared test suite from multiple existing implementations and then implementing a new system to satisfy that suite.
In the episode, Simon argued that because agents can generate and iterate on tests at near-zero human cost, tests are effectively no longer optional in agent-assisted coding workflows.
In the episode, Simon asserted that starting agent coding sessions by instructing how to run tests and to follow red-green TDD increases the likelihood of producing working code.
In the episode, Simon asserted that having agents perform manual end-to-end checks (such as starting a server and using curl) can catch failures that a passing automated test suite misses, including cases where the server does not boot.
In the episode, Simon stated that he built a tool called Showboat that records an agent's manual testing steps into a Markdown document including commands run and their outputs.
In the episode, Simon argued that testing against production user data should be avoided and that agent-assisted mocking and synthetic data generation can create specific edge-case users on demand instead.

Adoption Progression And Trust Model For Agent Output

In the episode, Simon characterized an emerging practice where agents produce code that humans neither write nor read, and suggested this may be irresponsible even if some teams claim it works.
In the episode, Simon described a staged progression of AI tool adoption for programmers from chatbot Q&A to using coding agents that eventually write more code than the programmer does.
In the episode, Simon proposed treating trust in AI output similarly to trust in other internal teams' services by relying on interfaces and documentation and inspecting internals primarily when failures occur.
In the episode, Simon reported high confidence that for familiar task classes a strong model can reliably generate correct implementations such as a paginated JSON API backed by a database.
In the episode, Simon stated that model reliability has reached a point where he can often one-shot small engineering changes with short prompts and predict outcomes confidently.

Security Framing: Containment Over Sanitization

In the episode, Simon stated that he often runs Claude locally with permission safeguards disabled for convenience and tries to mitigate risk by avoiding untrusted repository instructions.
In the episode, Simon argued that 'prompt injection' is a misleading term because, unlike SQL injection mitigations via parameterization, there is no reliable way to separate untrusted data from trusted instructions in LLM prompting.
In the episode, Simon described a 'lethal trifecta' risk condition for LLM systems: access to private data, exposure to malicious instructions, and an exfiltration channel to send stolen information to an attacker.
In the episode, Simon asserted that safely running coding agents depends primarily on sandboxing so that a compromised or misled agent has limited ability to cause harm.

Code Quality Tradeoffs And Leverage Via Refactoring Loops And Scaffolding

In the episode, Simon asserted that low-quality agent output is partly a controllable choice because iteratively prompting an agent to refactor can yield code quality exceeding what a time-constrained human would produce.
In the episode, Simon asserted that coding agents strongly replicate existing codebase patterns and templates, and that maintaining a high-quality baseline plus exemplar tests causes agents to extend the project in the same style.
In the episode, Simon asserted that whether code quality matters depends on context, where short-lived tools can tolerate lower-quality code while long-term maintained systems require higher code quality.

Ecosystem Effects: Component Commoditization And Oss Maintainer Load

In the episode, Simon suggested that AI-assisted programming is reducing demand for reusable UI component libraries because custom components can be generated on demand.
In the episode, Simon suggested that open source projects are being flooded with low-quality automated pull requests.

Watchlist

In the episode, Simon characterized an emerging practice where agents produce code that humans neither write nor read, and suggested this may be irresponsible even if some teams claim it works.
In the episode, Simon stated that he often runs Claude locally with permission safeguards disabled for convenience and tries to mitigate risk by avoiding untrusted repository instructions.
In the episode, Simon suggested that AI-assisted programming is reducing demand for reusable UI component libraries because custom components can be generated on demand.
In the episode, Simon suggested that open source projects are being flooded with low-quality automated pull requests.

Unknowns

What are measured defect rates, incident rates, and time-to-fix metrics for agent-built features compared with human-built features under comparable constraints?
How much do explicit testing instructions, red-green TDD prompting, and agent-run smoke tests improve success rates, and under what project conditions do they fail?
How often do teams actually run 'no-read' pipelines, and what compensating controls (sandboxing, monitoring, formal methods, restricted interfaces) are used when they do?
What is the prevalence of running agents with permission safeguards disabled, and is this correlated with security incidents or near-misses?
In real deployments, how often do systems satisfy the 'lethal trifecta' conditions, and which mitigations most effectively break the chain (data minimization, input hardening, egress controls, sandboxing)?

Investor overlay

Read-throughs

Rising spend on testing and verification tooling as agent workflows rely on explicit test-running, red-green prompting, smoke tests, and conformance suites to make delegated code dependable.
Increased demand for sandboxing and containment controls as primary mitigations for agent risks, especially if teams run agents with permission safeguards disabled for convenience.
Pressure on reusable UI component library value if teams generate custom components on demand, shifting budgets toward tooling that accelerates safe generation and integration.

What would confirm

Product messaging and customer case studies emphasizing test-first agent workflows, recorded manual test transcripts, agent-run smoke tests, and conformance-driven test suite derivation as core adoption drivers.
Security guidance and enterprise requirements shifting toward sandboxing, restricted interfaces, monitoring, and egress controls for agents, with explicit acknowledgement that sanitization and instruction data separation are insufficient.
Developer surveys or maintainer reports showing reduced reliance on reusable UI component libraries and increased on-demand component generation in day-to-day workflows.

What would kill

Credible benchmarks showing agent-built features match or beat human-built features on defect rates and time-to-fix without increased test investment or specialized verification workflows.
Operational evidence that running agents with safeguards disabled is rare and not associated with incidents or near-misses, reducing urgency for containment tooling.
Adoption data indicating reusable UI component libraries remain stable in usage and budget, with teams preferring standardized components over generated custom UI.

Sources

My fireside chat about agentic engineering at the Pragmatic Summit

2026-03-14 simonwillison.net