Steven's Knowledge

Red-Teaming

Pay people to find the failures users will eventually find for free

Red-teaming is the practice of deliberately attacking your own LLM system to surface failures before adversaries — or normal users — do. For any LLM feature with material risk (financial, reputational, regulatory), it's not optional.

What You're Looking For

Red-teaming covers a broader surface than penetration testing:

  • Policy violations — getting the model to produce content it shouldn't.
  • Jailbreaks — bypassing safety training through clever prompting.
  • Prompt injection — embedded payloads that hijack the agent.
  • Data leakage — exfiltrating system prompts, training data, other users' content.
  • Tool misuse — abusing function-calling capabilities to do harm.
  • Hallucinated capabilities — convincing the model it can do something dangerous it normally refuses.
  • Bias and harm — outputs that are technically allowed but disproportionately harmful to specific groups.

Who Should Do It

Three layers, all useful:

  • Internal team — engineers, T&S, security. Cheap, continuous, deeply familiar with the system.
  • Cross-functional volunteers — people from outside the AI team who think differently. Catches blind spots.
  • External experts — specialized red-team firms or independent researchers. Bring novel attack patterns and an outside perspective.

Each layer finds different failures. A mature program runs all three.

Methodology

A repeatable process:

  1. Define the threat model. What categories matter for this product? What's out of scope?
  2. Build attack libraries. Catalogue known patterns: jailbreak prompts, injection payloads, sensitive topic probes.
  3. Run structured attacks. Hit every category, document results, score severity.
  4. Run unstructured attacks. Let creative humans wander. They find things checklists miss.
  5. Triage and fix. Prioritize by exploitability and impact.
  6. Add to regression suite. Every fixed issue becomes a test that runs forever.

The last step is what compounds. A red-teaming program without regression coverage just keeps finding the same things.

Automation

LLMs are also attackers. Automated red-teaming uses one model to generate attacks against another, scales further than humans alone:

  • Adversarial generation — a model trained or prompted to produce jailbreaks.
  • Mutation strategies — start from a known attack, mutate to find variants.
  • Closed-loop attack/defense — attacker model and defender model iterate against each other.

Automation finds easy attacks fast. Humans still dominate at finding the unusual, contextual ones.

Severity and Triage

Not every finding is a P0. A useful axis:

  • How easy is it to trigger? (Trivially, with effort, only by experts.)
  • What's the impact? (Embarrassing, harmful to users, legally serious, life-safety.)
  • How widely does it apply? (Single specific input, whole category of inputs.)

Combine these into a severity score and triage accordingly.

Disclosure and Response

For systems with real exposure, set up the muscle ahead of time:

  • Internal channel for reporters (employees, contractors, beta testers).
  • External channel — a clear path for outside researchers to report findings.
  • Acknowledge / fix / disclose workflow with target timelines.
  • Post-mortems for serious incidents.

Treat it like product security has always been treated, because functionally it is.

What Counts as Success

Not "no findings." Findings are the goal — they're what you wanted to discover. Success looks like: a steady flow of issues identified and fixed, declining severity over time, regressions caught before launch, no major surprises in the wild. Zero findings means your red team isn't trying hard enough.

On this page