Steven's Knowledge

Computer Use & Browser Automation

Agents that operate UIs the way humans do — and why the patterns differ from API-based agents

Computer-use agents drive a screen — moving the cursor, typing into fields, clicking buttons — instead of calling APIs. Browser automation is the most common form, with general desktop control catching up. The category opens up huge product surfaces: any system that has a UI but lacks an API becomes accessible to an agent.

Why It Matters

Most enterprise software is still API-poor. Internal admin panels, government portals, legacy ERPs, vendor dashboards — vast amounts of work happen behind UIs that no integration ever covered. A capable computer-use agent removes that wall: if a human can do it in a browser, the agent can do it.

How It Works

Two broad patterns:

Vision-grounded control. A vision-language model sees the screen, identifies elements visually, and produces actions (move mouse to (x, y), click, type). Frontier examples: Anthropic's Claude computer use, OpenAI's CUA, browser-use, Operator, plus a wave of startups.

DOM-grounded control (browser-only). The agent reads the structured DOM — element IDs, ARIA labels, accessibility tree — and acts via direct selectors. Faster, more reliable when the DOM is clean, useless when content is rendered as images or canvases.

The strongest systems combine both: use the DOM when it's reliable, fall back to vision when it isn't.

The Action Loop

Every turn looks roughly like:

  1. Capture the current state — screenshot, DOM, URL, recent action history.
  2. Decide the next action — click, type, scroll, wait, navigate, finish.
  3. Execute it.
  4. Wait for the page to settle (network idle, animations done).
  5. Capture again. Repeat.

The "wait for the page to settle" step is where most reliability bugs live.

What's Hard

  • Visual grounding. Pinpointing the right pixel coordinates from a screenshot is genuinely hard. Even strong frontier models miss small targets, dense UIs, and elements in unusual states.
  • Dynamic content. Spinners, lazy loading, infinite scroll, modals appearing unexpectedly. Easy to mistake "still loading" for "task done."
  • State and recovery. When the agent gets confused mid-task, recovering is much harder than restarting — but restarting may have already submitted a form.
  • Speed. Even a fast computer-use agent is dramatically slower than direct API integration. Plan budgets and parallelism accordingly.
  • Authentication. Login flows, MFA, captchas, session expiration. A surprising amount of agent engineering is just session management.

Sandbox and Containment

Letting an agent drive a real machine is a security model. Practical containment:

  • Headless browsers in containers. Each agent run gets a fresh, isolated browser environment.
  • Per-task credentials. Don't give the agent your master account; provision scoped credentials per task.
  • Read-only first. Whenever the task allows, run with permissions that can't mutate.
  • Confirmation gates for irreversible actions — payments, sends, deletions. Have a human confirm before the agent commits.
  • Egress controls. Limit which sites the agent can navigate to.

The Prompt Injection page applies double here: any web page the agent sees can try to instruct it.

Patterns That Work

  • Decompose first. Have the agent plan the steps before executing. Plans are debuggable; raw click traces aren't.
  • Self-verify. After each significant action, have the agent confirm the page state matches what it expected.
  • Hybrid execution. Use API access where it exists; fall back to UI control only for the parts that need it.
  • Replays for tests. Save successful traces and replay them as regression tests; failures often manifest as deviations from a known good run.
  • Human handoff. When the agent gets stuck, hand off cleanly to a human rather than thrashing.

Browser-Specific Tooling

For browser automation specifically:

  • Playwright and Puppeteer are the dominant low-level drivers. Most agent stacks build on one of them.
  • browser-use, Stagehand, Skyvern — open-source agent frameworks specifically for browser control.
  • Cloud browsers — Browserbase, Anthropic-managed browsers, hyperbrowser — solve the "I need a thousand browsers running concurrently" problem.

Where It Works Today

Computer-use agents work well when:

  • The task is narrow and well-defined.
  • The target UI is reasonably stable.
  • A failure can be detected and either retried or escalated.
  • You can afford the latency (tasks take seconds to minutes, not milliseconds).
  • The cost of failure is bounded.

They struggle on long, branching, multi-application workflows where small errors compound. That's the current frontier — and the one improving most rapidly.

What to Watch

  • Browser-native models — models specifically post-trained on UI screenshots and DOMs.
  • Memory across sessions — agents that learn the UIs of common applications and don't have to re-discover them every run.
  • Multi-app workflows — tasks that span browsers, desktop apps, and terminals.
  • Standards — emerging protocols for "agent-friendly" web pages that expose semantic actions directly.

On this page