Auto-Mode Permission Automation With Pre-Action Safety Classification

Issue 83 Edition 2026-03-24 6 min read

General

Sources: 1 • Confidence: High • Updated: 2026-04-12 10:20

Key takeaways

The action-review classifier runs on Claude Sonnet 4.6 even when the main Claude Code session uses a different model.
Simon Willison states he is unconvinced that prompt-injection protections that rely on AI are reliable because they are non-deterministic.
Claude Code auto mode ships with extensive default filters and allows users to customize them with their own rules.
The documentation acknowledges that the classifier may allow risky actions when user intent is ambiguous or context is insufficient.
A raised concern is that permitting "pip install -r requirements.txt" by default may fail to protect against supply-chain attacks when dependencies are unpinned, and this concern is linked to a LiteLLM-related incident.

The action-review classifier runs on Claude Sonnet 4.6 even when the main Claude Code session uses a different model.
Claude Code introduced an "auto mode" permissions setting as an alternative to using the --dangerously-skip-permissions option.
In auto mode, Claude makes permission decisions on the user's behalf and safeguards monitor actions before they run.
A separate classifier model reviews the conversation before each action and blocks actions that exceed task scope, target untrusted infrastructure, or appear driven by hostile content encountered in files or web pages.

Simon Willison states he is unconvinced that prompt-injection protections that rely on AI are reliable because they are non-deterministic.
The documentation acknowledges that the classifier may allow risky actions when user intent is ambiguous or context is insufficient.
A raised concern is that permitting "pip install -r requirements.txt" by default may fail to protect against supply-chain attacks when dependencies are unpinned, and this concern is linked to a LiteLLM-related incident.
The author prefers coding agents to run in a robust default sandbox that deterministically restricts file access and network connections rather than relying on prompt-based protections like auto mode.

Claude Code auto mode ships with extensive default filters and allows users to customize them with their own rules.
By default, "project scope" is defined as the repository where the Claude Code session started, and access to locations such as ~/, ~/Library/, /etc, or other repositories is treated as scope escalation.
The default policy soft-denies higher-risk actions including destructive Git operations (such as force push), pushing directly to the default branch, downloading-and-executing external code, unsafe deserialization, and cloud storage mass deletion.

A raised concern is that permitting "pip install -r requirements.txt" by default may fail to protect against supply-chain attacks when dependencies are unpinned, and this concern is linked to a LiteLLM-related incident.

What are the observed false-positive and false-negative rates of the action-review classifier in auto mode across common development workflows?
What artifacts are produced for auditability (logs of proposed actions, allow/deny decisions, policy rule matches, and user overrides), and how long are they retained?
How are “untrusted infrastructure” and “hostile content” operationally defined, and can these definitions be tuned or enforced deterministically?
What is the exact policy language / rule system for user customization, and what guardrails prevent overly permissive configurations?
How does the repo-based scope model handle monorepos, generated code directories, submodules, vendored dependencies, and workspace symlinks?

Agentic developer tools may shift from manual permission prompts to automated action gating, increasing adoption if safety friction stays low and outcomes are auditable.
Demand may rise for deterministic controls like sandboxing and dependency pinning as complements to AI-based prompt-injection defenses, especially after supply-chain concerns tied to dependency installs.
A fixed-model safety boundary for action review could become a product differentiator, but concentrates trust in classifier reliability and raises scrutiny on false negatives under ambiguous intent.

Published or disclosed false-positive and false-negative rates for the action-review classifier in common workflows, plus evidence of low risky-action escape rates under ambiguous context.
Clear audit artifacts and retention policies for proposed actions, allow deny decisions, rule matches, and user overrides that enable enterprise governance and incident response.
Deterministic enforcement additions such as sandboxing controls and stricter dependency installation policies, including handling of unpinned requirements and download-and-execute patterns.

Documented incidents where auto mode allowed high-impact risky actions due to ambiguity or missing context, indicating classifier unreliability in real development environments.
User customization enables overly permissive configurations without strong guardrails, leading to material safety regressions or policy bypass in typical setups.
Repo-scope enforcement breaks in complex repos such as monorepos, symlinks, submodules, or vendored code, enabling unintended access or action escalation.