Decision Review & Compliance
A specialist reads a complex artifact and judges it against rules — the LLM does first-pass review, the human carries the accountability
Scenario Abstraction
A skilled professional (lawyer, compliance officer, underwriter, regulator, auditor, reviewer) is paid to read complex material and judge it against rules — a contract against playbook, a transaction against AML policy, a loan application against credit policy, a clinical chart against guidelines, a code change against security policy.
The work is high-stakes (a wrong judgment is a real loss), low-throughput (a senior person spends hours per artifact), and largely consists of finding things to flag, not generating new content. LLMs are well-suited to first-pass review — locating issues for the specialist to confirm — but typically not to deciding.
Solution Shape
- Define the policy / playbook — an explicit, written rule set. (If it doesn't exist, building it is half the project.)
- Ingest the artifact — contract, claim, application, code diff, transaction set.
- Decompose into checks — each rule becomes a check the LLM can perform with citations.
- Run checks — for each rule, find evidence in the artifact, classify as pass / fail / unclear.
- Explain & cite — every finding points back to the exact span / line / record that triggered it.
- Triage severity — group findings, suggest a priority for the human reviewer.
- Human decision — the reviewer accepts, overrides, or escalates. Their decision is the final record.
- Learn from overrides — disagreements feed a labeled set for prompt and policy refinement.
The LLM is not deciding; it is preparing a structured docket for the human's decision. That framing keeps liability and trust where they belong.
Key Building Blocks
- Versioned policy library — rules in a format that's both human-readable and prompt-injectable.
- Per-rule prompt + eval set — each rule has its own labeled examples and accuracy target.
- Strong grounding / citation requirement — no flag without a pointer.
- Reviewer UI — side-by-side artifact + findings + accept/override controls.
- Audit log — for every decision, what the model saw, what it said, who decided, why.
- Calibrated severity — fail / warn / info levels, calibrated against the cost of false positives.
Concrete Cases
- Contract review against playbook. NDA / MSA / DPA reviewed against the company's clause-by-clause playbook; deviations flagged with suggested redlines.
- AML / sanctions transaction review. Flag transactions matching suspicious patterns; the LLM writes the case narrative; an analyst decides whether to file an SAR.
- Loan underwriting assistance. Given application package, check against credit policy; produce a structured underwriting memo with findings.
- Insurance claim adjudication. Given claim + policy, check coverage, exclusions, fraud signals; recommend approve / investigate / deny with reasons.
- Clinical guideline review. Given a patient chart, surface gaps vs guideline (e.g., diabetic foot exam due, statin indicated but not prescribed).
- Internal audit / SOC2 evidence review. Read evidence packages, check whether they actually satisfy controls; flag insufficient evidence to the auditor.
- Privacy / DPA review for new vendors. Vendor docs in → identify data flows and risks → fill the privacy review form.
- Code change security review. PR diff + threat model → flag risky patterns (auth changes, new endpoints, secrets) for human security review.
- Marketing claims review — proposed creative checked against legal claim substantiation; LLM flags claims missing support.
Similar Scenarios
- Plan / design review — architectural design docs, engineering RFCs reviewed against guidelines.
- Grading / rubric scoring — student work, training-program portfolios, certification reviews.
- Court filing pre-check — procedural compliance review before submission.
- Manuscript desk review — submitted papers checked for formatting, scope, missing sections.
Pitfalls & Evaluation
- Confusing review with deciding. Don't auto-deny / auto-approve based on the LLM's call. Decisions in this scenario carry liability; the human carries it.
- Per-rule accuracy, not overall accuracy. Aggregate accuracy hides catastrophic failures on rare-but-critical rules. Track each rule separately.
- False positives erode trust faster than false negatives. Reviewers who see a noisy queue stop reading carefully. Calibrate aggressively against FP.
- Policy drift. When the playbook changes, all evals must be refreshed. Version the policy with the prompt.
- Adversarial inputs. Counterparties learn what triggers your checks. Pure prompt-only checks are bypassable; pair with deterministic rules where possible.
Useful metrics: per-rule precision/recall, reviewer override rate per rule, time-per-artifact reduction, downstream incident rate (did anything dangerous slip through), agreement with senior reviewer on a held-out set.