Agent Permission Delegation With Pre-Execution Safety Gate

Issue 83 Edition 2026-03-24 6 min read

General

Sources: 1 • Confidence: High • Updated: 2026-03-25 17:55

Key takeaways

The action-review classifier runs on Claude Sonnet 4.6 even when the main Claude Code session uses a different model.
Claude Code ships extensive default auto-mode filters and allows users to customize them with their own rules.
A commentator argues that prompt-injection protections that rely on AI are not reliable because they are non-deterministic.
Allowing "pip install -r requirements.txt" by default may not protect against supply-chain attacks when dependencies are unpinned.
Claude Code documentation acknowledges the classifier may allow risky actions when user intent is ambiguous or context is insufficient.

The action-review classifier runs on Claude Sonnet 4.6 even when the main Claude Code session uses a different model.
Claude Code introduced an "auto mode" permissions setting as an alternative to using --dangerously-skip-permissions.
In auto mode, Claude makes permission decisions on the user's behalf while safeguards monitor actions before they run.
A separate classifier model reviews the conversation before each action and blocks actions that exceed task scope, target untrusted infrastructure, or appear driven by hostile content encountered in files or web pages.

Claude Code ships extensive default auto-mode filters and allows users to customize them with their own rules.
Auto-mode defaults define "project scope" as the repository where the session started and treat access to locations like ~/, ~/Library/, /etc, or other repositories as a scope escalation.
Auto-mode defaults soft-deny higher-risk actions including destructive Git operations (e.g., force push), pushing directly to the default branch, downloading-and-executing external code (e.g., curl | bash or unsafe deserialization), and cloud storage mass deletion.

A commentator argues that prompt-injection protections that rely on AI are not reliable because they are non-deterministic.
Claude Code documentation acknowledges the classifier may allow risky actions when user intent is ambiguous or context is insufficient.
A commentator prefers coding agents to run in a robust default sandbox that deterministically restricts file access and network connections rather than relying on prompt-based protections like auto mode.

Allowing "pip install -r requirements.txt" by default may not protect against supply-chain attacks when dependencies are unpinned.

Allowing "pip install -r requirements.txt" by default may not protect against supply-chain attacks when dependencies are unpinned.

How often does the action-review classifier incorrectly allow risky actions (false negatives) in real-world use, especially under ambiguous instructions or partial context?
What is the false-positive rate (unnecessary blocks/soft-denies) for common developer workflows, and what are the highest-friction categories?
What are the exact semantics and expressiveness of user-defined rules (what can be constrained, how rules are evaluated, and whether rules are auditable/versioned)?
How does the system define and manage "trusted" versus "untrusted" infrastructure targets, and can this be configured per organization?
Does allowing dependency installation by default lead to measurable supply-chain exposure when requirements are unpinned, and are there planned or existing mitigations (e.g., requiring lockfiles)?

Fixing the action-review classifier to a specific model centralizes the approval boundary. Read-through: vendors may differentiate on predictable, auditable safety gates and configurable permission delegation, influencing enterprise adoption of agentic developer tools.
Default allow of pip install from unpinned requirements highlights supply-chain exposure. Read-through: demand could rise for lockfile enforcement and dependency provenance controls integrated into developer agents and IDE tooling.
Documented risk of false negatives under ambiguous intent plus skepticism about non-deterministic defenses. Read-through: buyers may require deterministic sandboxing and policy controls alongside model classifiers, shifting value toward security layers around agents.

Published metrics on false-negative and false-positive rates for the action-review classifier, including breakdowns by ambiguous intent or partial context, and evidence of improved outcomes over time.
Clear, auditable semantics for user-defined rules, including versioning, evaluation order, and organization-level configuration of trusted versus untrusted targets.
Product changes that mitigate unpinned dependency installs by default, such as requiring lockfiles or adding stronger guardrails for dependency installation actions.

Lack of disclosed reliability data for the classifier over time, or evidence that ambiguous instructions frequently lead to risky actions being allowed without effective mitigations.
User-defined rules remain limited, non-auditable, or hard to manage across organizations, reducing practical control over delegated permissions.
Dependency installation defaults remain permissive without measurable mitigations, and no movement toward lockfile or provenance requirements for common workflows.