Decision Review & Compliance

A specialist reads a complex artifact and judges it against rules — the LLM does first-pass review, the human carries the accountability

Scenario Abstraction

A skilled professional (lawyer, compliance officer, underwriter, regulator, auditor, reviewer) is paid to read complex material and judge it against rules — a contract against playbook, a transaction against AML policy, a loan application against credit policy, a clinical chart against guidelines, a code change against security policy.

The work is high-stakes (a wrong judgment is a real loss), low-throughput (a senior person spends hours per artifact), and largely consists of finding things to flag, not generating new content. LLMs are well-suited to first-pass review — locating issues for the specialist to confirm — but typically not to deciding.

Solution Shape

Define the policy / playbook — an explicit, written rule set. (If it doesn't exist, building it is half the project.)
Ingest the artifact — contract, claim, application, code diff, transaction set.
Decompose into checks — each rule becomes a check the LLM can perform with citations.
Run checks — for each rule, find evidence in the artifact, classify as pass / fail / unclear.
Explain & cite — every finding points back to the exact span / line / record that triggered it.
Triage severity — group findings, suggest a priority for the human reviewer.
Human decision — the reviewer accepts, overrides, or escalates. Their decision is the final record.
Learn from overrides — disagreements feed a labeled set for prompt and policy refinement.

The LLM is not deciding; it is preparing a structured docket for the human's decision. That framing keeps liability and trust where they belong.

Key Building Blocks

Versioned policy library — rules in a format that's both human-readable and prompt-injectable.
Per-rule prompt + eval set — each rule has its own labeled examples and accuracy target.
Strong grounding / citation requirement — no flag without a pointer.
Reviewer UI — side-by-side artifact + findings + accept/override controls.
Audit log — for every decision, what the model saw, what it said, who decided, why.
Calibrated severity — fail / warn / info levels, calibrated against the cost of false positives.

Concrete Cases

Contract review against playbook. NDA / MSA / DPA reviewed against the company's clause-by-clause playbook; deviations flagged with suggested redlines.
AML / sanctions transaction review. Flag transactions matching suspicious patterns; the LLM writes the case narrative; an analyst decides whether to file an SAR.
Loan underwriting assistance. Given application package, check against credit policy; produce a structured underwriting memo with findings.
Insurance claim adjudication. Given claim + policy, check coverage, exclusions, fraud signals; recommend approve / investigate / deny with reasons.
Clinical guideline review. Given a patient chart, surface gaps vs guideline (e.g., diabetic foot exam due, statin indicated but not prescribed).
Internal audit / SOC2 evidence review. Read evidence packages, check whether they actually satisfy controls; flag insufficient evidence to the auditor.
Privacy / DPA review for new vendors. Vendor docs in → identify data flows and risks → fill the privacy review form.
Code change security review. PR diff + threat model → flag risky patterns (auth changes, new endpoints, secrets) for human security review.
Marketing claims review — proposed creative checked against legal claim substantiation; LLM flags claims missing support.

Similar Scenarios

Plan / design review — architectural design docs, engineering RFCs reviewed against guidelines.
Grading / rubric scoring — student work, training-program portfolios, certification reviews.
Court filing pre-check — procedural compliance review before submission.
Manuscript desk review — submitted papers checked for formatting, scope, missing sections.

Pitfalls & Evaluation

Confusing review with deciding. Don't auto-deny / auto-approve based on the LLM's call. Decisions in this scenario carry liability; the human carries it.
Per-rule accuracy, not overall accuracy. Aggregate accuracy hides catastrophic failures on rare-but-critical rules. Track each rule separately.
False positives erode trust faster than false negatives. Reviewers who see a noisy queue stop reading carefully. Calibrate aggressively against FP.
Policy drift. When the playbook changes, all evals must be refreshed. Version the policy with the prompt.
Adversarial inputs. Counterparties learn what triggers your checks. Pure prompt-only checks are bypassable; pair with deterministic rules where possible.

Useful metrics: per-rule precision/recall, reviewer override rate per rule, time-per-artifact reduction, downstream incident rate (did anything dangerous slip through), agreement with senior reviewer on a held-out set.