Steven's Knowledge

Planning & Reasoning

How agents decide what to do next, and why that's still the hardest part

The capability that separates a useful agent from a clever chatbot is planning: looking ahead, decomposing a goal into actions, and adjusting when something doesn't go as expected. Models have gotten dramatically better at this, but it's still where most agent failures happen.

Reasoning Models vs Standard Models

Reasoning models (Claude with extended thinking, GPT o-series, Gemini thinking) are trained to generate long internal reasoning before producing an answer. They're meaningfully better at:

  • Multi-step math and logic.
  • Code with non-obvious bugs.
  • Planning under constraints.
  • Self-correction.

They're slower and more expensive. The right pattern is usually: route hard, plan-shaped problems to reasoning models; route everything else to fast models.

Planning Patterns

  • Implicit planning — the model figures out the next step turn by turn. Cheap, works for shallow tasks.
  • Explicit planning — the model writes a plan first, then executes it. Better for tasks with structure; the plan itself is debuggable.
  • Hierarchical planning — high-level plan with placeholder steps that get expanded into sub-plans. The right pattern when sub-tasks are themselves complex.

Explicit plans pay off whenever you want to inspect, edit, or resume.

Replanning

The world doesn't always cooperate. A robust agent has to:

  • Detect when the current plan is failing (errors, no progress, conflicting evidence).
  • Decide whether to retry the step, modify the plan, or ask for help.
  • Update state coherently when the plan changes.

Many agent failures aren't the planning step — they're the failure to notice that the plan stopped working.

Self-Reflection

After completing a step or a task, having the model review its own work catches mistakes that are obvious in retrospect. The pattern:

  1. Produce a result.
  2. Critique the result against the original goal and any rubrics.
  3. If issues found, revise.

This costs more tokens and adds latency. Use it where the task is high-stakes or where verification is cheap (e.g., did the tests pass?).

Verification Beats Generation

The most reliable agents pair generation with verification:

  • Generate code → run tests.
  • Generate SQL → run on a sandboxed snapshot.
  • Generate a plan → simulate or dry-run before executing.

A model checking another model's work is also legitimate verification, especially when the verifier has access to ground-truth signals the generator doesn't.

The Limits Today

Even strong reasoning models still struggle with very long-horizon tasks: keeping a plan coherent across hours of work, recovering from cascading errors, and noticing when the original goal has shifted. That's why bounded, supervised agents remain the dominant production pattern.

On this page