The Autonomy Ladder
A cross-cutting framework for deciding how much an LLM solution should do on its own
Every scenario in this catalog implicitly asks the same question: how much does the system do before a human steps in? The autonomy ladder is a shared vocabulary for that decision, so teams can talk about "shipping at Tier 2" or "graduating Tier 3 → Tier 4" instead of arguing about whether something is "AI-powered" or "agentic."
The Four Tiers
Tier 1 — Advisor
The system suggests, the human does. No state is mutated by the model. Outputs are inline tips, recommendations, ranked options.
- Examples: code-completion ghost text, "people also asked," personalized recommendation tiles.
- Failure mode of a bad suggestion: the human ignores it.
- Right tier when: stakes are individual, output is read in seconds, the human is the producer of the artifact.
Tier 2 — Copilot
The system drafts, the human edits and commits. The output is a fully-formed artifact (a reply, a summary, a memo, a JSON record) that a human will inspect and ship.
- Examples: drafted email reply, meeting recap with action items, generated product description awaiting publish, AP invoice extracted and pre-filled.
- Failure mode of a bad draft: the human catches it before it leaves the building.
- Right tier when: throughput matters but the human's name is still on the output, errors have meaningful cost, and review is cheaper than authoring.
Tier 3 — Supervised Actor
The system proposes an action that has real-world side effects; a human approves before it executes. The model can call tools but writes only after a human clicks confirm.
- Examples: agent that wrote a Xero reconciliation entry awaiting one-click approval, refund request prepared for a CSM to confirm, customer reply queued for shift lead sign-off.
- Failure mode of a bad action: caught in the approval queue, never reaches the customer / ledger / production.
- Right tier when: the action is reversible-but-painful (financial, customer-facing, account-changing) and approval can be done in seconds.
Tier 4 — Autonomous Operator
The system acts without per-item human approval, within a defined policy. A human still sets the policy, watches dashboards, and handles escalations.
- Examples: auto-tagging support tickets, autocategorizing transactions, agent that resolves a defined class of password resets, low-value high-confidence claims auto-paid.
- Failure mode of a bad action: it happened, and you're cleaning up. Mitigated by scope limits, kill switches, and rate limits.
- Right tier when: high volume, low per-event cost-of-error, clear policy, instrumented enough to detect anomalies fast.
Where Each Scenario Tends to Start
| Scenario | Typical launch tier | Common graduation |
|---|---|---|
| Conversation Intelligence | 2 | 4 for tagging/scoring; 2 for action items |
| Document-Driven Action | 2 | 3 for low-risk classes; 4 for high-confidence repeats |
| Knowledge Assistant | 1 | stays 1; the user always reads |
| Content Generation | 2 | 4 only with strong eval + brand guardrails |
| Decision Review | 1 or 2 | rarely past 2; decisions carry liability |
| Data Extraction | 2 | 4 with confidence-gated routing |
| Research & Synthesis | 2 | usually stays 2; output is human-consumed |
| Personalization | 1 or 4 | depends on UX — recommendations are 4, dialog is 1–2 |
| Workflow Automation | 3 | 4 per workflow type as confidence accrues |
| Multimodal Inspection | 2 | 4 with calibrated confidence thresholds |
| Coaching & Simulation | 1 | stays 1; feedback is the product |
| Creative Assistance | 1 | stays 1; the human is the author |
| Voice Automation | 3 | 4 for defined intents; escalate the rest |
| Trust & Safety | 4 | 4 from day one; humans audit decisions |
Graduation Criteria
A workflow earns the next tier when:
- Measured agreement with humans on a fresh, held-out sample exceeds the bar for that decision class.
- Error cost × error rate is acceptable — not just accuracy, but expected loss.
- Recoverability exists — undo, dispute, reversal, audit log.
- Anomaly detection is in place — you'll know within hours, not weeks, if quality drifts.
- A kill switch can flip the workflow back down a tier without a deploy.
Without all five, do not promote. The most expensive LLM incidents in the wild are tier-3 workflows that were promoted to tier-4 informally — usually after a senior person said "looks good" for a week.
Two Anti-Patterns
- Premature Tier 4. Shipping autonomous behavior before evaluation infrastructure exists. The model is good enough on the demo set, the team announces "AI handles this now," then real traffic exposes the long tail. Recovery means rolling back publicly.
- Permanent Tier 1. A great copilot kept at Tier 1 forever because nobody owns the graduation work. The capability is real but no business value compounds. Promotion is itself a feature.
How to Use This Page
When scoping a new LLM project, decide the launch tier explicitly:
- What tier does the strongest stakeholder want? (usually too high)
- What tier does the eval evidence support today? (usually 1 or 2)
- What's the cheapest first step that creates measurable value? (often Tier 2)
- What would need to be true to graduate? (write it down; this becomes the eval / monitoring roadmap)
The right answer is rarely the highest tier. The right answer is the tier you can defend.