Content Moderation

Inputs and outputs both need filtering — and the policy is harder than the technology

Any product that lets users interact with an LLM eventually has to decide what users can ask, what the model can produce, and what to do when those collide. Content moderation is the layer that enforces those decisions in practice.

Two Sides of the Filter

Input moderation — does the user's request violate policy? Block, refuse, or escalate.
Output moderation — does the model's response violate policy? Suppress, regenerate, or warn.

Both matter. A clean input can produce a problematic output (jailbreaks, hallucinations, leaked context). A toxic input might generate a benign refusal that's still worth logging.

Where to Filter

Frontier providers ship built-in safety classifiers and refusal behaviors. They cover the obvious categories (CSAM, violent extremism, etc.) by default. On top of that you'll typically want:

Pre-call moderation API — OpenAI, Google, Azure, AWS, and several third parties offer fast classifiers for hate, harassment, sexual content, self-harm, violence.
Custom classifiers for product-specific policy (e.g., financial advice, medical claims, brand-unsafe topics).
Output classifiers — re-check generated content before showing it to the user.

Policy First, Technology Second

The hard part is rarely the classifier — it's deciding the policy. Questions to settle before implementation:

What categories are out of scope? (Violence, sexual, self-harm, hate, illegal activity, professional advice, IP, others?)
What are the thresholds for each? (Strict, moderate, lenient — by category.)
What happens on a hit? (Hard block, soft warning, escalate to human, log silently.)
How do users appeal a false block?
Who in the company owns policy changes?

Policy that isn't written down ends up implicitly defined by individual engineers' intuitions, which is its own problem.

False Positives Are Real Costs

Aggressive moderation is its own product harm:

Refusing benign medical questions.
Blocking discussion of historical violence in an educational tool.
Filtering out non-English content because the classifier wasn't trained on it.

Track false positive rate alongside false negative rate. Both have real users behind them.

User-Facing Refusals

When you do refuse, the refusal itself is a UX surface:

Be specific about what was blocked, not vague "I can't help with that."
Suggest alternatives when possible.
Provide a feedback path so genuine false positives get fixed.
Don't reveal your policy in detail — over-specific error messages turn into a tutorial for bypassing filters.

Logging and Audit

Every moderation decision should be logged:

What was the input and output.
What classifier flagged it and at what confidence.
What action was taken.
The user, session, and request ID.

Trust & Safety, Legal, and Engineering all rely on this audit trail when something serious happens.

When Compliance Joins the Conversation

Regulated industries (finance, healthcare, education involving minors) layer additional requirements on top of normal moderation: data retention rules, mandatory reporting, age verification, jurisdiction-aware policies. These belong in the design from the start, not bolted on later.

A Pragmatic Default

For most consumer-facing products: built-in provider safety + a third-party moderation API on inputs + a lightweight output classifier + clear policy doc + human review queue for edge cases. That's the floor. The ceiling depends on your domain.

On this page