The Autonomy Ladder

A cross-cutting framework for deciding how much an LLM solution should do on its own

Every scenario in this catalog implicitly asks the same question: how much does the system do before a human steps in? The autonomy ladder is a shared vocabulary for that decision, so teams can talk about "shipping at Tier 2" or "graduating Tier 3 → Tier 4" instead of arguing about whether something is "AI-powered" or "agentic."

The Four Tiers

Tier 1 — Advisor

The system suggests, the human does. No state is mutated by the model. Outputs are inline tips, recommendations, ranked options.

Examples: code-completion ghost text, "people also asked," personalized recommendation tiles.
Failure mode of a bad suggestion: the human ignores it.
Right tier when: stakes are individual, output is read in seconds, the human is the producer of the artifact.

Tier 2 — Copilot

The system drafts, the human edits and commits. The output is a fully-formed artifact (a reply, a summary, a memo, a JSON record) that a human will inspect and ship.

Examples: drafted email reply, meeting recap with action items, generated product description awaiting publish, AP invoice extracted and pre-filled.
Failure mode of a bad draft: the human catches it before it leaves the building.
Right tier when: throughput matters but the human's name is still on the output, errors have meaningful cost, and review is cheaper than authoring.

Tier 3 — Supervised Actor

The system proposes an action that has real-world side effects; a human approves before it executes. The model can call tools but writes only after a human clicks confirm.

Examples: agent that wrote a Xero reconciliation entry awaiting one-click approval, refund request prepared for a CSM to confirm, customer reply queued for shift lead sign-off.
Failure mode of a bad action: caught in the approval queue, never reaches the customer / ledger / production.
Right tier when: the action is reversible-but-painful (financial, customer-facing, account-changing) and approval can be done in seconds.

Tier 4 — Autonomous Operator

The system acts without per-item human approval, within a defined policy. A human still sets the policy, watches dashboards, and handles escalations.

Examples: auto-tagging support tickets, autocategorizing transactions, agent that resolves a defined class of password resets, low-value high-confidence claims auto-paid.
Failure mode of a bad action: it happened, and you're cleaning up. Mitigated by scope limits, kill switches, and rate limits.
Right tier when: high volume, low per-event cost-of-error, clear policy, instrumented enough to detect anomalies fast.

Where Each Scenario Tends to Start

Scenario	Typical launch tier	Common graduation
Conversation Intelligence	2	4 for tagging/scoring; 2 for action items
Document-Driven Action	2	3 for low-risk classes; 4 for high-confidence repeats
Knowledge Assistant	1	stays 1; the user always reads
Content Generation	2	4 only with strong eval + brand guardrails
Decision Review	1 or 2	rarely past 2; decisions carry liability
Data Extraction	2	4 with confidence-gated routing
Research & Synthesis	2	usually stays 2; output is human-consumed
Personalization	1 or 4	depends on UX — recommendations are 4, dialog is 1–2
Workflow Automation	3	4 per workflow type as confidence accrues
Multimodal Inspection	2	4 with calibrated confidence thresholds
Coaching & Simulation	1	stays 1; feedback is the product
Creative Assistance	1	stays 1; the human is the author
Voice Automation	3	4 for defined intents; escalate the rest
Trust & Safety	4	4 from day one; humans audit decisions

Graduation Criteria

A workflow earns the next tier when:

Measured agreement with humans on a fresh, held-out sample exceeds the bar for that decision class.
Error cost × error rate is acceptable — not just accuracy, but expected loss.
Recoverability exists — undo, dispute, reversal, audit log.
Anomaly detection is in place — you'll know within hours, not weeks, if quality drifts.
A kill switch can flip the workflow back down a tier without a deploy.

Without all five, do not promote. The most expensive LLM incidents in the wild are tier-3 workflows that were promoted to tier-4 informally — usually after a senior person said "looks good" for a week.

Two Anti-Patterns

Premature Tier 4. Shipping autonomous behavior before evaluation infrastructure exists. The model is good enough on the demo set, the team announces "AI handles this now," then real traffic exposes the long tail. Recovery means rolling back publicly.
Permanent Tier 1. A great copilot kept at Tier 1 forever because nobody owns the graduation work. The capability is real but no business value compounds. Promotion is itself a feature.

How to Use This Page

When scoping a new LLM project, decide the launch tier explicitly:

What tier does the strongest stakeholder want? (usually too high)
What tier does the eval evidence support today? (usually 1 or 2)
What's the cheapest first step that creates measurable value? (often Tier 2)
What would need to be true to graduate? (write it down; this becomes the eval / monitoring roadmap)

The right answer is rarely the highest tier. The right answer is the tier you can defend.

On this page