Rosa Del Mar

Daily Brief

Issue 73 2026-03-14

Testing As The Primary Control Surface In Agentic Engineering

Issue 73 Edition 2026-03-14 9 min read
General
Sources: 1 • Confidence: Medium • Updated: 2026-04-12 10:16

Key takeaways

  • In the episode, Simon described conformance-driven development using an LLM to derive a shared test suite from multiple existing implementations and then implementing a new system to satisfy that suite.
  • In the episode, Simon characterized an emerging practice where agents produce code that humans neither write nor read, and suggested this may be irresponsible even if some teams claim it works.
  • In the episode, Simon stated that he often runs Claude locally with permission safeguards disabled for convenience and tries to mitigate risk by avoiding untrusted repository instructions.
  • In the episode, Simon asserted that low-quality agent output is partly a controllable choice because iteratively prompting an agent to refactor can yield code quality exceeding what a time-constrained human would produce.
  • In the episode, Simon suggested that AI-assisted programming is reducing demand for reusable UI component libraries because custom components can be generated on demand.

Sections

Testing As The Primary Control Surface In Agentic Engineering

  • In the episode, Simon described conformance-driven development using an LLM to derive a shared test suite from multiple existing implementations and then implementing a new system to satisfy that suite.
  • In the episode, Simon argued that because agents can generate and iterate on tests at near-zero human cost, tests are effectively no longer optional in agent-assisted coding workflows.
  • In the episode, Simon asserted that starting agent coding sessions by instructing how to run tests and to follow red-green TDD increases the likelihood of producing working code.
  • In the episode, Simon asserted that having agents perform manual end-to-end checks (such as starting a server and using curl) can catch failures that a passing automated test suite misses, including cases where the server does not boot.
  • In the episode, Simon stated that he built a tool called Showboat that records an agent's manual testing steps into a Markdown document including commands run and their outputs.
  • In the episode, Simon argued that testing against production user data should be avoided and that agent-assisted mocking and synthetic data generation can create specific edge-case users on demand instead.

Adoption Progression And Trust Model For Agent Output

  • In the episode, Simon characterized an emerging practice where agents produce code that humans neither write nor read, and suggested this may be irresponsible even if some teams claim it works.
  • In the episode, Simon described a staged progression of AI tool adoption for programmers from chatbot Q&A to using coding agents that eventually write more code than the programmer does.
  • In the episode, Simon proposed treating trust in AI output similarly to trust in other internal teams' services by relying on interfaces and documentation and inspecting internals primarily when failures occur.
  • In the episode, Simon reported high confidence that for familiar task classes a strong model can reliably generate correct implementations such as a paginated JSON API backed by a database.
  • In the episode, Simon stated that model reliability has reached a point where he can often one-shot small engineering changes with short prompts and predict outcomes confidently.

Security Framing: Containment Over Sanitization

  • In the episode, Simon stated that he often runs Claude locally with permission safeguards disabled for convenience and tries to mitigate risk by avoiding untrusted repository instructions.
  • In the episode, Simon argued that 'prompt injection' is a misleading term because, unlike SQL injection mitigations via parameterization, there is no reliable way to separate untrusted data from trusted instructions in LLM prompting.
  • In the episode, Simon described a 'lethal trifecta' risk condition for LLM systems: access to private data, exposure to malicious instructions, and an exfiltration channel to send stolen information to an attacker.
  • In the episode, Simon asserted that safely running coding agents depends primarily on sandboxing so that a compromised or misled agent has limited ability to cause harm.

Code Quality Tradeoffs And Leverage Via Refactoring Loops And Scaffolding

  • In the episode, Simon asserted that low-quality agent output is partly a controllable choice because iteratively prompting an agent to refactor can yield code quality exceeding what a time-constrained human would produce.
  • In the episode, Simon asserted that coding agents strongly replicate existing codebase patterns and templates, and that maintaining a high-quality baseline plus exemplar tests causes agents to extend the project in the same style.
  • In the episode, Simon asserted that whether code quality matters depends on context, where short-lived tools can tolerate lower-quality code while long-term maintained systems require higher code quality.

Ecosystem Effects: Component Commoditization And Oss Maintainer Load

  • In the episode, Simon suggested that AI-assisted programming is reducing demand for reusable UI component libraries because custom components can be generated on demand.
  • In the episode, Simon suggested that open source projects are being flooded with low-quality automated pull requests.

Watchlist

  • In the episode, Simon characterized an emerging practice where agents produce code that humans neither write nor read, and suggested this may be irresponsible even if some teams claim it works.
  • In the episode, Simon stated that he often runs Claude locally with permission safeguards disabled for convenience and tries to mitigate risk by avoiding untrusted repository instructions.
  • In the episode, Simon suggested that AI-assisted programming is reducing demand for reusable UI component libraries because custom components can be generated on demand.
  • In the episode, Simon suggested that open source projects are being flooded with low-quality automated pull requests.

Unknowns

  • What are measured defect rates, incident rates, and time-to-fix metrics for agent-built features compared with human-built features under comparable constraints?
  • How much do explicit testing instructions, red-green TDD prompting, and agent-run smoke tests improve success rates, and under what project conditions do they fail?
  • How often do teams actually run 'no-read' pipelines, and what compensating controls (sandboxing, monitoring, formal methods, restricted interfaces) are used when they do?
  • What is the prevalence of running agents with permission safeguards disabled, and is this correlated with security incidents or near-misses?
  • In real deployments, how often do systems satisfy the 'lethal trifecta' conditions, and which mitigations most effectively break the chain (data minimization, input hardening, egress controls, sandboxing)?

Investor overlay

Read-throughs

  • Rising spend on testing and verification tooling as agent workflows rely on explicit test-running, red-green prompting, smoke tests, and conformance suites to make delegated code dependable.
  • Increased demand for sandboxing and containment controls as primary mitigations for agent risks, especially if teams run agents with permission safeguards disabled for convenience.
  • Pressure on reusable UI component library value if teams generate custom components on demand, shifting budgets toward tooling that accelerates safe generation and integration.

What would confirm

  • Product messaging and customer case studies emphasizing test-first agent workflows, recorded manual test transcripts, agent-run smoke tests, and conformance-driven test suite derivation as core adoption drivers.
  • Security guidance and enterprise requirements shifting toward sandboxing, restricted interfaces, monitoring, and egress controls for agents, with explicit acknowledgement that sanitization and instruction data separation are insufficient.
  • Developer surveys or maintainer reports showing reduced reliance on reusable UI component libraries and increased on-demand component generation in day-to-day workflows.

What would kill

  • Credible benchmarks showing agent-built features match or beat human-built features on defect rates and time-to-fix without increased test investment or specialized verification workflows.
  • Operational evidence that running agents with safeguards disabled is rare and not associated with incidents or near-misses, reducing urgency for containment tooling.
  • Adoption data indicating reusable UI component libraries remain stable in usage and budget, with teams preferring standardized components over generated custom UI.

Sources