Using AI to review code and audit security — effective review prompts, what AI catches versus misses, bug-hunting, the limits of security review, false positives, and combining with human review

AI Code Review

AI is a genuinely useful reviewer. It reads a diff faster than you do, never gets tired on the 40th file, and catches the boring-but-real mistakes — the unhandled null, the swapped arguments, the missing await — that a human skims past at 5pm. Used well, it raises the floor on code quality before a human ever looks.

But it is a reviewer with specific blind spots and a habit of confident nonsense. It will flag a non-issue with the same certainty it uses for a real bug, and it will sail past a logic error that breaks your billing because it does not know what "correct" means for your business. This page is about getting the value — a faster, more thorough first pass — without trusting it to be the last pass.

Where AI Review Fits

AI review is a layer, not a replacement. The most effective setup runs it before human review, so humans spend their attention on the things only humans can judge.

Layer	Catches	Speed
Automated checks (lint, types, tests)	Mechanical, rule-based issues	Instant
AI review	Likely bugs, smells, missing cases, security patterns	Seconds
Human review	Intent, design, business correctness, trade-offs	Minutes–hours

AI sits in the middle: broader and faster than a human, but shallower in judgment. It clears the noise so the human reviewer is not distracted by the obvious stuff and can focus on whether the change is the right change.

Review Prompts That Work

A vague "review this code" gets a vague answer — usually a polite summary of what the code does, which is useless. Direct the review at specific failure classes.

Review this diff as a senior engineer. For each issue, give:
- File and line
- Severity (critical / major / minor / nit)
- Why it's a problem
- A concrete fix

Focus on, in priority order:
1. Correctness bugs and unhandled edge cases
2. Security issues (injection, authz, secrets, unsafe input)
3. Race conditions and concurrency
4. Error handling and failure modes
5. Then style and readability

Do not comment on formatting that the linter handles.
If you find nothing critical, say so plainly — do not invent issues.

The structure matters. Asking for severity forces prioritization so you are not buried in nits. The priority order keeps it focused on what breaks production, not on bikeshedding variable names. And "do not invent issues" pushes back on the model's instinct to find something to say.

Give It the Context to Judge Correctly

A reviewer that cannot see the surrounding code reviews in a vacuum. Provide what a human reviewer would have:

Here is the diff:
[diff]

Here is the function it calls and the type it returns:
[paste callee + types]

The requirement: discounts only apply to orders over $500.
The project conventions are in CLAUDE.md.

Does the change correctly implement the requirement?
What edge cases around the $500 boundary are not handled?

With the requirement stated, the model can check correctness against intent rather than guessing. Without it, the model only knows whether the code is internally consistent — not whether it does the right thing.

What AI Catches Well

In practice, AI review is reliably good at a specific set of issues:

Unhandled edge cases — empty arrays, null/undefined, zero, negative numbers, the boundary value. AI is excellent at enumerating inputs you forgot.
Missing error handling — a promise without a .catch, a file read without a failure path, an API call assuming success.
Obvious security patterns — string-concatenated SQL, eval on user input, hardcoded secrets, cors({ origin: '*' }), missing input validation.
Inconsistencies — a function that returns null in one branch and undefined in another, mismatched error shapes.
Off-by-one and boundary errors — <= where < was meant, fencepost mistakes in loops and slices.
Copy-paste bugs — the block that was duplicated and edited but missed one variable rename.

These are exactly the mistakes humans make and then fail to spot in their own code, because they read what they meant to write. AI reads what is actually there.

What AI Misses

The misses are where the danger lives, because they are invisible by definition — the AI says nothing and you assume there was nothing to say.

Business-logic errors. The code is clean, the tests pass, and it computes GST at 10% instead of 15%. AI does not know your domain rules unless you tell it, and even then it cannot judge whether the rule itself is right.
Architectural problems. AI reviews the diff in front of it. It will not tell you that this is the third place you have reimplemented the same caching logic, or that this change locks you into a design you will regret.
Missing requirements. AI checks whether the code is correct; it cannot check whether the code does everything the ticket asked for. The feature can be flawless and half-built.
Cross-system effects. A change that is fine in isolation but breaks a downstream consumer, violates an API contract, or doubles a database load — AI cannot see past the diff.
Subtle concurrency. AI catches obvious race conditions but misses the ones that require understanding your actual deployment, locking model, and traffic patterns.
"Why" problems. AI evaluates the code as written. It cannot tell you the entire approach is wrong and the feature should not be built this way.

The pattern: AI reviews the code; humans review the change. AI answers "is this code correct?" Humans answer "is this the right thing to build, in the right way, given everything else?"

Bug-Hunting

Beyond reviewing a diff, AI is useful for actively hunting bugs in existing code. The trick is to make it adversarial rather than affirming.

This function handles payment refunds. Act as an attacker and a
chaos tester. Find every way this can produce a wrong result or
fail unsafely:
- What inputs break it?
- What happens on partial failure mid-operation?
- Can it double-refund? Refund more than was paid?
- What concurrent calls cause problems?

For each, show the exact input or sequence that triggers it.

Framing it as "find the ways this breaks" produces far more than "is this correct?" — the latter invites the model to reassure you. Asking for the triggering input forces concreteness; a bug you can reproduce is real, a vague "this might fail" is often noise.

A complementary technique is differential review: paste two versions and ask what behavior changed. AI is good at spotting that a refactor "preserving behavior" actually changed an edge case.

The Limits of AI Security Review

AI security review is worth running and dangerous to trust. It reliably catches the textbook vulnerabilities — the OWASP-top-10 shapes it has seen ten thousand times. It does not catch the vulnerabilities that matter most, which are usually specific to your system.

AI finds reliably	AI misses
SQL/command injection patterns	Broken authorization logic specific to your roles
Hardcoded secrets and keys	Business-logic flaws (e.g. negative-quantity orders)
Obvious XSS sinks	Multi-step exploit chains
Weak crypto (MD5, `Math.random` for tokens)	Subtle auth bypasses across requests
Missing input validation	Race conditions in security-critical paths
`eval`, unsafe deserialization	Anything requiring knowledge of your data model

The critical limitation: most real-world breaches exploit authorization and business logic, not textbook injection. "Can user A read user B's data by changing an ID in the URL?" is the kind of flaw that drains companies, and it requires understanding your permission model — which AI does not have. Treat AI security review as a smoke detector for known-bad patterns, never as a substitute for a security engineer reviewing auth-critical code by hand.

And never let AI security review create false confidence. "The AI found no issues" means "no obvious issues" — it is not a clean bill of health. For anything touching authentication, authorization, payments, or sensitive data, a human with security expertise still reviews it.

False Positives

The biggest practical cost of AI review is noise. The model is rewarded for finding things, so it finds things — including non-issues, stylistic preferences dressed up as bugs, and "problems" that your codebase deliberately handles elsewhere.

Common false positives:

Flagging a missing null check on a value that is guaranteed non-null by an upstream type.
Suggesting error handling for an operation that genuinely cannot fail.
Recommending "improvements" that contradict your project's deliberate conventions.
Inventing edge cases that cannot occur given the actual input constraints.
Re-flagging something handled in code it cannot see.

Manage the noise:

Make it rank by severity so you triage criticals first and can ignore the nits.
Give it more context — most false positives come from the model not seeing the guarantee that makes its concern moot.
Treat findings as leads, not verdicts. Each finding is "look here," not "this is broken." You verify before acting.
Push back. "You flagged a null check on user.id, but user is non-null per the function signature — is this still an issue?" The model will usually concede a genuine false positive.

A reviewer who cries wolf trains you to ignore it. The fix is not to stop using AI review but to filter it — prioritized output and adequate context cut false positives dramatically.

Combining AI and Human Review

The two are complementary, and the workflow that gets the most from both runs them in sequence, each doing what it is best at.

Author self-reviews with AI first. Before opening the PR, the author runs AI review on their own diff and fixes the real findings. This catches the embarrassing stuff privately and shrinks the diff a human has to wade through.
AI does the broad first pass on the PR. It enumerates edge cases, flags security patterns, checks error handling — the thorough, tireless sweep.
The human reviewer focuses on judgment. With the mechanical issues already handled, the human spends their limited attention on intent, design, business correctness, and "is this the right change?"
The human owns the decision. AI findings inform the review; they never approve it. A human reads the change, applies judgment, and clicks merge.

This division plays to strengths: AI's breadth and stamina for the mechanical pass, human judgment for everything that requires understanding the system and the business. Neither alone is sufficient — AI-only review ships logic bugs and auth flaws; human-only review is slower and skims the boring cases.

The Bottom Line

AI is an excellent first reviewer and a dangerous last one. It catches the edge cases and obvious security patterns that humans skim past, and it does it in seconds — so run it, ideally before you even open the PR. But it reviews the code, not the change. It does not know your business rules, cannot judge your architecture, and treats "no obvious issues" as "no issues."

Use AI to raise the floor: a faster, more thorough first pass that lets human reviewers spend their judgment where it counts. Then let a human — who understands the system, the domain, and the stakes — make the call. The AI finds candidates; the human decides.

AI Code Review

On this page