Trust & Safety

High-volume, adversarial moderation, abuse, and fraud detection — where errors compound and the other side is trying to win

Scenario Abstraction

A platform takes in a high volume of content, accounts, or transactions, and a non-trivial fraction of it is bad — spam, abuse, hate, fraud, scams, prohibited goods, IP infringement, CSAM, payment fraud, account takeover, fake reviews, regulatory violations. The job is to detect, classify, decide, and act at platform scale, against an adversarial counterparty that actively learns what your detectors look for.

This scenario looks like Decision Review at first — both apply rules to content — but the dynamics are completely different:

	Decision Review	Trust & Safety
Throughput	Tens to hundreds per day	Thousands to millions per day
Per-decision cost-of-error	High (financial / legal)	Low individually, catastrophic in aggregate
Counterparty	Cooperative (vendors, employees)	Adversarial (constantly probing)
Drift	Slow (policy changes)	Fast (new abuse patterns weekly)
Human review	Every item	Sample / appeals only
Typical autonomy	Tier 1–2	Tier 4 from day one

LLMs change what's tractable here: nuanced policy ("is this implicit incitement?") that used to require human moderation can now be checked at platform scale, and policy can be updated in days instead of weeks of retraining a classifier.

Solution Shape

Policy as text — written, versioned, exemplified. Edge cases collected from prior incidents.
Ingest — content / events at platform rate.
Cheap pre-filter — heuristics, regex, classical classifiers, embedding-based nearest-neighbor to known-bad. Filters out the easy yes/no cases.
LLM judgment for the middle — the nuanced cases go to an LLM check with the relevant policy slice + content + (sometimes) account context.
Decision + severity + reason — allow / restrict / remove / ban, with a short structured reason for audit and appeal.
Act — apply the action via platform APIs; notify the user where applicable.
Appeals queue — every user-impacting action is appealable to a human; appeal outcomes feed the eval set.
Adversarial monitoring — track novel patterns (sudden cluster of similar accounts, new evasion phrasings), elevate to human review and policy update.
Red-team — internal team produces adversarial test cases to keep the policy + prompts honest.

Key Building Blocks

Versioned policy library — same structure as Decision Review, but updated weekly.
Tiered detector stack — never call the LLM when a $0.0001 classifier suffices.
Account / behavior context — fraud signals usually require account history, not just the content.
Structured decision output — action + severity + policy section cited.
Appeal system — both UX and data plumbing.
Red-team / golden set — adversarial examples, including ones the system used to fail on.
Anomaly detection on decisions — distribution shifts (sudden spike in a category) are the leading indicator of a new abuse wave.
Two-eyes for the rarest, most damaging classes — CSAM, terrorism — human always in the loop, never autonomous.

Concrete Cases

UGC moderation on a social platform. Posts, comments, DMs checked against hate, harassment, sexual content, self-harm policies; nuanced policy judgments (irony, reclaimed language, context) routed to LLM checks.
Marketplace listing review. Listings checked for prohibited goods, IP infringement, counterfeits, miscategorization, scam patterns.
Review fraud detection. Detect fake / incentivized / coordinated reviews on commerce or app stores using text patterns + account graph.
Ads creative review. Auto-approve / hold / reject ad creatives against advertising policy (claims substantiation, prohibited categories, regulated industries).
Payment fraud + chargeback prevention. Transaction context + free-text fields (shipping, name, email) checked for fraud signals; classical model decides, LLM writes the case narrative for analysts.
Account takeover triage. Suspicious login / signup events surfaced; LLM summarizes the evidence chain for a human analyst.
Anti–money laundering case narratives. LLM drafts SAR narrative from a structured case package; analyst approves and files.
Bot / inauthentic behavior detection. Patterns across accounts (timing, content templates) flagged; LLM helps classify intent.
Dating / community grooming detection. Sequences of messages between accounts checked against grooming patterns; immediate human escalation for high-confidence.
Customer-support fraud / abuse triage. Inbound tickets attempting refund fraud, social engineering, or account takeover routed to specialized queues.
Live-stream moderation. Audio + video frames sampled; near-realtime decisions on violations.
Election / health misinformation labeling. Tiered policy with public-interest exceptions, journalist exceptions, satire handling.

Similar Scenarios

Internal data loss prevention (DLP) — same shape, the "abuse" is exfiltration; counterparty is sometimes internal.
AI-generated content provenance / watermark checks — same pipeline, the policy is "made by humans or properly disclosed."
Email / messaging spam — older incarnation; LLMs now help on the nuanced ~5% the classical filters miss.
Brand-safety for ad serving — adjacent: prevent ads from running next to objectionable content.

Pitfalls & Evaluation

False positives erode user trust faster than false negatives erode platform trust. Wrongful bans are PR incidents. Calibrate aggressively, make appeals fast and visible.
Policy creep. Every incident adds a clause. After a year, policy is incoherent. Schedule policy refactors; treat policy as code.
Adversarial drift. A working detector decays in weeks. Without continuous adversarial sampling and red-team, accuracy silently rots.
Outsourced reasoning. LLM-as-judge is convenient but creates an attack surface: prompt-injection in user-controlled content can manipulate the judge. Sanitize, sandbox, never let user content carry instructions to the judge.
The worst classes aren't autonomous. CSAM, credible threats, terrorism — humans always in the loop, with appropriate jurisdictional reporting. Never put these on autopilot regardless of model quality.
Aggregate harms are invisible per-item. A single innocuous-looking post is fine; the coordinated 10,000-account version is not. Detection must operate at account / cluster level, not just content.

Useful metrics: per-policy precision/recall on red-team set, per-policy precision/recall on appealed decisions, time-to-detect for new abuse patterns, appeal-overturn rate (the most honest accuracy signal), prevalence (estimated % of bad content reaching users — the ultimate KPI), reviewer queue health.