Steven's Knowledge

Decision Review & Compliance

A specialist reads a complex artifact and judges it against rules — the LLM does first-pass review, the human carries the accountability

Scenario Abstraction

A skilled professional (lawyer, compliance officer, underwriter, regulator, auditor, reviewer) is paid to read complex material and judge it against rules — a contract against playbook, a transaction against AML policy, a loan application against credit policy, a clinical chart against guidelines, a code change against security policy.

The work is high-stakes (a wrong judgment is a real loss), low-throughput (a senior person spends hours per artifact), and largely consists of finding things to flag, not generating new content. LLMs are well-suited to first-pass review — locating issues for the specialist to confirm — but typically not to deciding.

Solution Shape

  1. Define the policy / playbook — an explicit, written rule set. (If it doesn't exist, building it is half the project.)
  2. Ingest the artifact — contract, claim, application, code diff, transaction set.
  3. Decompose into checks — each rule becomes a check the LLM can perform with citations.
  4. Run checks — for each rule, find evidence in the artifact, classify as pass / fail / unclear.
  5. Explain & cite — every finding points back to the exact span / line / record that triggered it.
  6. Triage severity — group findings, suggest a priority for the human reviewer.
  7. Human decision — the reviewer accepts, overrides, or escalates. Their decision is the final record.
  8. Learn from overrides — disagreements feed a labeled set for prompt and policy refinement.

The LLM is not deciding; it is preparing a structured docket for the human's decision. That framing keeps liability and trust where they belong.

Key Building Blocks

  • Versioned policy library — rules in a format that's both human-readable and prompt-injectable.
  • Per-rule prompt + eval set — each rule has its own labeled examples and accuracy target.
  • Strong grounding / citation requirement — no flag without a pointer.
  • Reviewer UI — side-by-side artifact + findings + accept/override controls.
  • Audit log — for every decision, what the model saw, what it said, who decided, why.
  • Calibrated severity — fail / warn / info levels, calibrated against the cost of false positives.

Concrete Cases

  • Contract review against playbook. NDA / MSA / DPA reviewed against the company's clause-by-clause playbook; deviations flagged with suggested redlines.
  • AML / sanctions transaction review. Flag transactions matching suspicious patterns; the LLM writes the case narrative; an analyst decides whether to file an SAR.
  • Loan underwriting assistance. Given application package, check against credit policy; produce a structured underwriting memo with findings.
  • Insurance claim adjudication. Given claim + policy, check coverage, exclusions, fraud signals; recommend approve / investigate / deny with reasons.
  • Clinical guideline review. Given a patient chart, surface gaps vs guideline (e.g., diabetic foot exam due, statin indicated but not prescribed).
  • Internal audit / SOC2 evidence review. Read evidence packages, check whether they actually satisfy controls; flag insufficient evidence to the auditor.
  • Privacy / DPA review for new vendors. Vendor docs in → identify data flows and risks → fill the privacy review form.
  • Code change security review. PR diff + threat model → flag risky patterns (auth changes, new endpoints, secrets) for human security review.
  • Marketing claims review — proposed creative checked against legal claim substantiation; LLM flags claims missing support.

Similar Scenarios

  • Plan / design review — architectural design docs, engineering RFCs reviewed against guidelines.
  • Grading / rubric scoring — student work, training-program portfolios, certification reviews.
  • Court filing pre-check — procedural compliance review before submission.
  • Manuscript desk review — submitted papers checked for formatting, scope, missing sections.

Pitfalls & Evaluation

  • Confusing review with deciding. Don't auto-deny / auto-approve based on the LLM's call. Decisions in this scenario carry liability; the human carries it.
  • Per-rule accuracy, not overall accuracy. Aggregate accuracy hides catastrophic failures on rare-but-critical rules. Track each rule separately.
  • False positives erode trust faster than false negatives. Reviewers who see a noisy queue stop reading carefully. Calibrate aggressively against FP.
  • Policy drift. When the playbook changes, all evals must be refreshed. Version the policy with the prompt.
  • Adversarial inputs. Counterparties learn what triggers your checks. Pure prompt-only checks are bypassable; pair with deterministic rules where possible.

Useful metrics: per-rule precision/recall, reviewer override rate per rule, time-per-artifact reduction, downstream incident rate (did anything dangerous slip through), agreement with senior reviewer on a held-out set.

On this page