Agentic Workflows
How to direct autonomous coding agents — plan-then-execute, self-verification loops, tool use, guardrails, reviewing agent diffs, and when to supervise versus let it run
Agentic Workflows
An inline completion tool suggests the next line. An agent does the whole task. You give it a goal — "add rate limiting to the login endpoint, with tests" — and it reads the relevant files, makes a plan, edits multiple files, runs the tests, sees a failure, fixes it, and runs them again until they pass. The difference is autonomy: the agent decides what steps to take and executes them without you approving each keystroke.
This is the most powerful and the most dangerous mode of AI-assisted development. Powerful because a good agent can complete in fifteen minutes a task that would take you an hour of context-switching. Dangerous because an agent confidently doing the wrong thing across ten files is far harder to unwind than a single bad autocomplete. This page is about getting the power without the danger.
The Agentic Loop
Every coding agent — Claude Code, Cursor's agent mode, and others — runs the same fundamental loop:
1. UNDERSTAND read the goal and the relevant code
2. PLAN decide the steps to take
3. ACT edit files, run commands
4. OBSERVE read the output (test results, errors, logs)
5. CORRECT if something failed, adjust and go back to ACT
6. STOP when the goal is met (or it's stuck)The quality of an agent run depends on how well each stage works. A weak agent skips the OBSERVE step — it edits code but never runs it, so it cannot tell that it broke something. A strong agent closes the loop: it runs the tests, reads the failure, and fixes it. The single best predictor of agent quality is whether it can verify its own work.
Plan-Then-Execute
The most reliable way to run an agent on anything non-trivial is to separate planning from execution. Ask for the plan first, review it, then let it run.
Before writing any code, give me a plan:
- Which files you'll change and why
- The order of operations
- How you'll test each change
- Anything you're unsure about
Wait for my approval before editing.Reviewing a plan is cheap; reviewing a 12-file diff is expensive. If the plan is wrong — the agent misunderstood the requirement, or chose the wrong abstraction — you catch it in thirty seconds instead of after it has touched the whole codebase. For anything that spans more than two or three files, plan-then-execute is the default, not the exception.
| Task size | Approach |
|---|---|
| One-line fix, obvious change | Just let it execute |
| Single file, clear requirement | Execute, review the diff |
| 2–3 files | Quick plan, then execute |
| Multi-file feature | Detailed plan, approve, then execute |
| Architecture change, migration | Plan, discuss, break into supervised steps |
Self-Verification Loops
The defining feature of a good agentic workflow is that the agent checks its own work and you do not have to babysit every step. This only happens if you give it the means to verify.
Tell the agent how to verify, ideally in your rule file so you do not repeat it:
After every change, run:
pnpm typecheck && pnpm test && pnpm lint
Do not consider the task done until all three pass.
If a test fails, read the failure, fix the cause, and re-run.
Do not delete or skip a failing test to make it pass.That last line matters. An agent under pressure to "make the tests pass" will sometimes take the shortcut of deleting the failing assertion or adding .skip. Explicitly forbid it. The verification loop is only valuable if the agent fixes the code, not the test.
What Makes a Good Verification Signal
- Fast — if the test suite takes ten minutes, the agent's loop is painfully slow. Point it at the relevant subset first.
- Deterministic — flaky tests confuse the agent; it will "fix" code that was never broken.
- Specific — a failing test with a clear message ("expected 429, got 200") guides the agent. A generic "build failed" leaves it guessing.
- Layered — typecheck catches type errors fast, tests catch logic errors, lint catches style. Each layer narrows the problem.
Tool Use
What separates an agent from a chatbot is tools — the actions it can take in your environment. Common tools and the discipline each requires:
| Tool | What it enables | Guardrail |
|---|---|---|
| File read/write | Edit code directly | Scope to the repo, not the whole disk |
| Shell execution | Run tests, builds, git | Restrict destructive commands |
| Git | Commit, branch, diff | Let it commit; you decide when to push |
| Web/docs fetch | Look up current API docs | Verify what it retrieves |
| MCP servers | Query DB, read issues | Read-only by default (see Context Engineering) |
The more tools an agent has, the more it can do unattended — and the more it can do wrong unattended. A read-only agent is safe but limited. A shell-enabled agent is powerful but can run rm -rf if it misreads a situation. Match the tool grant to your trust level for the task.
When to Let It Run Unattended vs Supervise
This is the central judgment call of agentic development. The deciding factors are reversibility and blast radius.
| Let it run unattended | Supervise closely |
|---|---|
| Changes are easy to revert (git) | Touches production or live data |
| Scoped to a feature branch | Modifies CI/CD, deploy, or infra |
| Strong test coverage exists | Touches auth, payments, or security |
| You'll review the full diff after | Deletes data or files irreversibly |
| Well-understood, repetitive task | Novel architecture decision |
| Low cost if it goes wrong | Cannot easily undo the result |
A useful default: let agents run unattended inside a sandbox of reversibility — a clean feature branch, a local environment, a dataset you can restore. The git history is your undo button. The moment a task escapes that sandbox — it would deploy something, charge a card, or drop a table — you supervise every step.
The "Walk Away" Test
Before letting an agent run while you get coffee, ask: if this goes completely wrong, how bad is the worst case, and how long to undo it? If the answer is "I revert a branch, thirty seconds," walk away. If the answer is "I'm restoring a production database from backup," do not walk away — supervise, or do not let the agent near it at all.
Guardrails
Guardrails are the constraints that keep an autonomous agent from causing damage. Layer them:
- Branch isolation. Agents work on feature branches, never directly on
main. The branch is disposable. - No force-push, no direct deploy. The agent can commit; pushing to shared branches and deploying stay manual.
- Command allowlists/denylists. Many tools let you pre-approve safe commands (
pnpm test) and block dangerous ones (rm -rf,git push --force,DROP TABLE). - Read-only external access. Database and API connections are read-only unless a write is the explicit point of the task.
- A human merge gate. No agent diff reaches
mainwithout a human reviewing and merging it.
# Example: commands the agent may run without asking
allow: pnpm test, pnpm typecheck, pnpm lint, git add, git commit, git diff
# Commands that always require human confirmation
deny: git push, rm -rf, DROP, DELETE FROM, curl to prod, kubectlGuardrails are not about distrusting the model. They are about bounding the cost of its mistakes. Even a 95%-reliable agent makes a damaging mistake one time in twenty, and guardrails ensure that mistake is a reverted branch, not a deleted database.
Reviewing Agent Diffs
An agent can produce a 400-line diff across a dozen files in two minutes. Reviewing it is now your primary job, and it is harder than reviewing a human PR because the author cannot meaningfully explain its intent.
Review agent output with extra suspicion in these areas:
- Scope creep. Agents often "helpfully" refactor unrelated code, rename things, or reformat files you did not ask them to touch. Check that the diff does only what you asked. Unrelated changes go back out.
- Deleted or weakened tests. Search the diff for removed assertions,
.skip,.only, or commented-out tests. This is the most common way an agent fakes success. - New dependencies. Agents add packages freely. Verify each new dependency is necessary, maintained, and not a hallucinated name.
- Suppressed errors. Look for
try/catchblocks that swallow errors,// @ts-ignore, oreslint-disableadded to make checks pass. - The parts that "look done." Well-formatted code with passing tests still needs a logic review. Passing tests prove the code does what the tests check — not that the tests check the right things.
A practical technique: have the agent summarize its own diff before you read it.
Summarize the change you just made:
- What files changed and why
- Any decisions where you were uncertain
- Anything you changed beyond the original request
- What is NOT covered that a reviewer should checkThe summary orients your review, and the "where you were uncertain" prompt surfaces exactly the spots that need human eyes.
Parallel Agents
Because agents work on branches, you can run several at once on independent tasks — three terminals, three agents, three features. This multiplies throughput but introduces new failure modes.
| Parallel agents work well when | Parallel agents cause pain when |
|---|---|
| Tasks are independent | Tasks touch the same files |
| Each has its own branch/worktree | They share a working directory |
| Tasks are clearly scoped | Boundaries are fuzzy and overlap |
| You review each diff separately | You try to review all at once |
The biggest practical issue is merge conflicts and overlapping edits. Two agents editing package.json or the same module will collide. Use separate git worktrees so each agent has its own checkout, and assign tasks that touch disjoint parts of the codebase. Do not run parallel agents on tasks you have not clearly separated — the time spent untangling their conflicting changes erases the throughput gain.
A realistic parallel setup: one agent writing tests for an existing module, one updating documentation, one implementing a small isolated feature. Three non-overlapping tasks, three branches, three diffs you review independently. What does not work is "five agents, all improving the codebase" — with no boundaries, they fight each other.
A Realistic Agentic Session
Putting it together, here is a well-run agentic task:
- State the goal with verification built in. "Add a
soft deleteto the Orders model. Migration, repository method, tests. Runpnpm testandpnpm typecheckbefore you're done." - Ask for a plan. Review it. The agent proposed editing the wrong repository file; you correct it. Thirty seconds saved a wrong diff.
- Let it execute on a feature branch. It writes the migration, adds the method, writes tests, runs them, hits a type error, fixes it, re-runs, passes.
- Read the agent's self-summary. It flags that it was unsure whether soft-deleted orders should still appear in the admin export. Good catch — you answer.
- Review the full diff. You find it also reformatted an unrelated file (scope creep) and added a
console.log. You have it remove both. - Run the tests yourself. Confirm green. Then you push and open the PR.
The agent did the mechanical work and verified its own changes. You supplied the goal, the judgment, and the merge decision. The agent never had the final say.
The Bottom Line
Agentic workflows are the highest-leverage and highest-risk mode of AI-assisted development. The leverage comes from autonomy and self-verification; the risk comes from the same autonomy operating without judgment. Get the leverage safely by separating planning from execution, giving the agent a way to verify its own work, bounding it with guardrails and reversibility, and reviewing every diff as if a confident junior wrote it.
Let the agent run. But keep the merge button — and the judgment behind it — firmly in human hands.