YOLO Classifier — Auto-Mode Permission System¶
Why this exists¶
When users enable "auto mode" (aka YOLO mode), the harness needs to decide which tool calls to auto-approve and which to block — without asking the user every time. This is done by a side-query classifier: a separate, fast API call that evaluates the tool action against customizable rules.
The classifier is one of the most interesting patterns from the Claude Code architecture because it solves a real tension: users want speed (no permission prompts), but the harness must prevent destructive actions (deleting files, force-pushing, dropping tables). The YOLO classifier resolves this by offloading the decision to a cheap, fast model that applies user-defined rules.
Architecture overview¶
Tool call requested
|
Is tool in safe-allowlist? --- YES ---> auto-approve (skip classifier)
| NO
Is tool a file edit in CWD? --- YES ---> auto-approve (acceptEdits fast path)
| NO
Side-query to fast model (Haiku) with:
- Tool name + input summary
- Conversation transcript (compressed)
- Customizable allow/deny rules
|
Classifier returns: { should_block: bool, reason: str }
|
should_block=true --- DENY with reason
should_block=false --- ALLOW
Three layers of checks run in order. Most tool calls resolve at the first two layers and never hit the classifier at all.
Safe-allowlist (skip classifier entirely)¶
These tools are so safe they never need classifier checking. They are read-only, metadata-only, or internal coordination tools:
SAFE_ALLOWLISTED_TOOLS = {
# Read-only file operations
"read_file", "grep", "glob", "lsp",
# Search / discovery
"tool_search", "list_mcp_resources", "read_mcp_resource",
# Task management (metadata only)
"todo_write", "task_create", "task_get",
"task_update", "task_list", "task_stop",
# Plan mode / UI interaction
"ask_user_question", "enter_plan_mode", "exit_plan_mode",
# Agent coordination (internal mailbox only)
"team_create", "team_delete", "send_message",
# Misc safe
"sleep",
}
def is_auto_mode_allowlisted(tool_name: str) -> bool:
return tool_name in SAFE_ALLOWLISTED_TOOLS
The allowlist is intentionally conservative. Adding a tool here means it can never be blocked, even if the user has a deny rule that matches it.
Customizable rules¶
Users can customize what auto-mode allows and denies through three rule categories:
from dataclasses import dataclass
@dataclass(frozen=True)
class AutoModeRules:
allow: tuple[str, ...] = () # "git status", "npm test", "python -m pytest"
soft_deny: tuple[str, ...] = () # "rm -rf", "DROP TABLE", "force push"
environment: tuple[str, ...] = () # "Node.js project", "uses PostgreSQL"
- allow: Commands and patterns that should be auto-approved. These are things the user trusts and runs frequently.
- soft_deny: Commands and patterns that should be blocked even in auto mode. These are destructive or irreversible operations.
- environment: Context about the project that helps the classifier make better decisions. Not rules — just hints.
Rules are passed directly into the classifier prompt, so they can be natural language. A rule like "never run migrations without asking" works just as well as "DROP TABLE".
The classifier side-query¶
The classifier runs as a separate API call using a fast model (Haiku). It receives the tool action plus the rules and returns a structured decision via forced tool use.
import anthropic
import json
async def classify_tool_action(
client: anthropic.Anthropic,
tool_name: str,
tool_input: dict,
transcript: list[dict],
rules: AutoModeRules,
) -> tuple[bool, str]:
"""Side-query classifier. Returns (should_block, reason)."""
# Build compressed transcript (last N entries as JSONL)
transcript_text = "\n".join(
json.dumps({
"role": entry["role"],
"action": entry.get("action", entry.get("content", "")[:200]),
})
for entry in transcript[-20:]
)
system_prompt = f"""You are a security classifier for an AI coding assistant.
Evaluate whether this tool action should be ALLOWED or BLOCKED.
ALLOW rules (auto-approve these):
{chr(10).join(f'- {r}' for r in rules.allow)}
DENY rules (block these):
{chr(10).join(f'- {r}' for r in rules.soft_deny)}
Environment context:
{chr(10).join(f'- {r}' for r in rules.environment)}
Respond with the classify_result tool."""
response = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=256,
system=system_prompt,
messages=[{
"role": "user",
"content": (
f"Tool: {tool_name}\n"
f"Input: {json.dumps(tool_input)}\n\n"
f"Recent transcript:\n{transcript_text}"
),
}],
tools=[{
"name": "classify_result",
"description": "Return classification decision",
"input_schema": {
"type": "object",
"properties": {
"thinking": {
"type": "string",
"description": "Brief reasoning about the action",
},
"should_block": {"type": "boolean"},
"reason": {
"type": "string",
"description": "Why allow or block",
},
},
"required": ["thinking", "should_block", "reason"],
},
}],
tool_choice={"type": "tool", "name": "classify_result"},
)
for block in response.content:
if block.type == "tool_use":
return block.input["should_block"], block.input["reason"]
return True, "Classifier failed to return decision"
from openai import OpenAI
import json
async def classify_tool_action(
client: OpenAI,
tool_name: str,
tool_input: dict,
transcript: list[dict],
rules: AutoModeRules,
) -> tuple[bool, str]:
"""Side-query classifier. Returns (should_block, reason)."""
transcript_text = "\n".join(
json.dumps({
"role": e["role"],
"action": e.get("action", e.get("content", "")[:200]),
})
for e in transcript[-20:]
)
response = client.responses.create(
model="gpt-4.1-mini",
instructions=f"""You are a security classifier for an AI coding assistant.
Evaluate whether this tool action should be ALLOWED or BLOCKED.
ALLOW rules: {json.dumps(rules.allow)}
DENY rules: {json.dumps(rules.soft_deny)}
Environment: {json.dumps(rules.environment)}""",
input=(
f"Tool: {tool_name}\n"
f"Input: {json.dumps(tool_input)}\n\n"
f"Recent transcript:\n{transcript_text}"
),
tools=[{
"type": "function",
"name": "classify_result",
"description": "Return classification decision",
"parameters": {
"type": "object",
"properties": {
"thinking": {"type": "string"},
"should_block": {"type": "boolean"},
"reason": {"type": "string"},
},
"required": ["thinking", "should_block", "reason"],
},
}],
tool_choice={"type": "function", "name": "classify_result"},
)
for item in response.output:
if item.type == "function_call":
result = json.loads(item.arguments)
return result["should_block"], result["reason"]
return True, "Classifier failed"
Key design decisions in the classifier:
- Forced tool use (
tool_choice) guarantees structured output. No parsing needed. - Compressed transcript gives the classifier conversational context without blowing up token usage. Only the last 20 entries, truncated to 200 chars each.
- The
thinkingfield forces the model to reason before deciding. This improves accuracy on ambiguous cases. - Fail-closed: if the classifier errors or returns nothing, the action is blocked.
Full auto-mode permission flow¶
from enum import Enum
class PermissionDecision(Enum):
ALLOW = "allow"
DENY = "deny"
ASK = "ask"
async def auto_mode_check(
tool_name: str,
tool_input: dict,
context,
classifier_client,
rules: AutoModeRules,
) -> tuple[PermissionDecision, str]:
"""Complete auto-mode permission check."""
# Fast path 1: safe-allowlisted tools
if is_auto_mode_allowlisted(tool_name):
return PermissionDecision.ALLOW, "Safe-allowlisted tool"
# Fast path 2: file edits in working directory (acceptEdits mode)
if tool_name in ("file_edit", "file_write"):
file_path = tool_input.get("file_path", "")
if file_path.startswith(context.working_directory):
return PermissionDecision.ALLOW, "File edit in working directory"
# Slow path: side-query classifier
should_block, reason = await classify_tool_action(
classifier_client,
tool_name,
tool_input,
context.transcript,
rules,
)
if should_block:
return PermissionDecision.DENY, reason
return PermissionDecision.ALLOW, reason
The three-layer design means the classifier is only called for genuinely ambiguous actions. In practice, the safe-allowlist and acceptEdits fast paths handle 80%+ of tool calls, so the classifier only fires for bash commands, network requests, and other side-effecting operations.
Build it yourself¶
- Define a safe-allowlist of read-only tools that never need classification
- Add an acceptEdits fast path for file operations in the working directory
- Implement a side-query classifier using a fast model with structured output (forced tool use)
- Make rules customizable with allow/deny patterns plus environment context
- Log classifier decisions for debugging — dump the full request/response when a flag is set
- Fail closed — if the classifier errors, block the action and surface the error to the user
Cost optimization
The classifier uses a fast, cheap model (Haiku / GPT-4.1-mini). Each classification costs roughly $0.001. The safe-allowlist and acceptEdits fast paths avoid the classifier call for the majority of tool uses, keeping the per-session cost negligible.
Security note
The soft_deny rules are advisory, not a security boundary. A determined model can rephrase commands to bypass pattern-based deny rules. For hard security boundaries, use the permission system described in the Permissions chapter instead.