Tests As The Primary Control Plane For Agent-Written Code
Sources: 1 • Confidence: Medium • Updated: 2026-03-15 09:33
Key takeaways
- Conformance-driven development can be done by using an LLM to derive a shared test suite from multiple existing implementations and then implementing a new system to satisfy that suite.
- A newly emerging practice is to have agents produce code that humans neither write nor read.
- The presenter reports often running Claude locally with permission safeguards disabled for convenience and attempts to mitigate risk by avoiding untrusted repository instructions.
- Low-quality agent output is partly a controllable choice because iterative prompting for refactoring can yield code quality exceeding what a time-constrained human would produce.
- AI-assisted programming is reducing demand for reusable UI component libraries because custom components can be generated on demand.
Sections
Tests As The Primary Control Plane For Agent-Written Code
- Conformance-driven development can be done by using an LLM to derive a shared test suite from multiple existing implementations and then implementing a new system to satisfy that suite.
- In agent-assisted coding workflows, tests are effectively no longer optional because agents can generate and iterate on tests at near-zero human cost.
- Starting an agent coding session by telling it how to run tests and to follow red-green TDD increases the likelihood the agent produces working code.
- Having agents perform manual end-to-end checks such as starting a server and using curl can catch failures that a passing automated test suite misses (e.g., a server not booting).
- The presenter built a tool called Showboat that records an agent’s manual testing steps into a Markdown document including commands run and their outputs.
- Testing against production user data should be avoided in favor of agent-assisted mocking and synthetic data generation for edge cases.
Agentic Adoption Stages And Workflow Delegation
- A newly emerging practice is to have agents produce code that humans neither write nor read.
- AI tool adoption for programmers tends to progress from asking chatbots questions to using coding agents that eventually write more code than the programmer does.
- A proposed trust model for AI output is to treat it like an internal service: rely on interfaces and documentation, and inspect internals mainly when failures occur.
- Claude Code combined with Sonnet 3.5 is described as an inflection point that made terminal-driving coding agents feel useful enough to do real work.
- The presenter reports that model reliability has reached a point where they can often one-shot small engineering changes with short prompts and predict outcomes confidently.
Security Model For Agents: Containment Over Sanitization Analogies
- The presenter reports often running Claude locally with permission safeguards disabled for convenience and attempts to mitigate risk by avoiding untrusted repository instructions.
- The presenter disputes the term 'prompt injection' as misleading because there is no reliable way to separate untrusted data from trusted instructions in LLM prompting in the way SQL parameterization separates data from code.
- A catastrophic exfiltration risk arises when an LLM has access to private data, is exposed to malicious instructions, and has an exfiltration channel to send information to an attacker.
- Safely running coding agents depends primarily on sandboxing so that a compromised or misled agent has limited ability to cause harm.
Code Quality As An Adjustable Parameter Via Iteration And Scaffolding
- Low-quality agent output is partly a controllable choice because iterative prompting for refactoring can yield code quality exceeding what a time-constrained human would produce.
- Coding agents strongly replicate existing codebase patterns and templates, so maintaining a high-quality baseline and exemplar tests causes agents to extend the project in that same style.
- Whether code quality matters depends on context: short-lived single-page tools can tolerate low-quality code, while long-term maintained systems require higher code quality.
Ecosystem Impacts: Components And Open-Source Maintenance Load
- AI-assisted programming is reducing demand for reusable UI component libraries because custom components can be generated on demand.
- Open source projects are being flooded with low-quality automated pull requests.
Watchlist
- A newly emerging practice is to have agents produce code that humans neither write nor read.
- The presenter reports often running Claude locally with permission safeguards disabled for convenience and attempts to mitigate risk by avoiding untrusted repository instructions.
- AI-assisted programming is reducing demand for reusable UI component libraries because custom components can be generated on demand.
- Open source projects are being flooded with low-quality automated pull requests.
Unknowns
- What are measured defect rates, rollback rates, and time-to-fix for agent-generated routine features compared with human-written features under similar constraints?
- How common are 'no human reads the code' pipelines in practice, and what compensating controls (tests, runtime monitoring, sandboxing, audits) correlate with acceptable incident rates?
- How effective is sandboxing in real deployments at preventing the harms highlighted by the exfiltration threat model, including egress pathways and secret access?
- What is the prevalence of developers disabling permission safeguards for convenience, and how strongly does that behavior correlate with security incidents or near-misses?
- How much do exemplar templates, tests, and codebase conventions measurably influence downstream agent output quality and maintainability across different projects and models?