Steven's Knowledge

Confidence & Uncertainty

Models that always sound certain are users' biggest hazard — communicate uncertainty deliberately

LLMs sound confident even when they shouldn't. The default tone of every model is calm, declarative authority — even on questions it has no real basis to answer. Building features that let users sense how much to trust the output is one of the highest-leverage UX investments in AI.

The Confidence Problem

A model that hedges on everything is annoying. A model that asserts everything is dangerous. The right design lets the system express different levels of confidence based on what it actually knows — and surfaces that to the user in a way they can act on.

Three main sources of unwarranted confidence:

  • Training distribution. Models are trained on text from confident humans; they imitate confidence by default.
  • Generation dynamics. Once a model commits to a claim early in its response, it tends to defend it for the rest of the output.
  • Evaluation pressure. Models that hedge get lower scores on benchmarks. Hedging gets trained out.

The solution is partly prompting, partly post-processing, and largely UI.

Where to Express Uncertainty

A few patterns, in increasing investment:

Refusals. When the model genuinely doesn't have an answer, refusing — clearly, with a reason — beats a hallucinated guess. This is the floor.

Hedging language. "I think...", "Based on the available information...", "It's likely that..." Trained or prompted into the model's tone. Costs verbosity, gains calibration.

Confidence flags. Explicit "low confidence" labels on specific claims or whole responses. Visible to the user as a subtle but unmissable cue.

Score-based UI. Numeric confidence (a 0–1 score, percentage, or tier) shown alongside the answer. Useful for advanced users; usually opaque to typical ones.

Differential rendering. High-confidence claims rendered normally; low-confidence claims rendered in italics, gray, or with a marker.

The right level depends on user sophistication and the stakes of the decision.

Estimating Confidence

LLMs don't natively output reliable confidence. Approaches to extract it:

  • Self-report. Ask the model to rate its own confidence. Cheap, weakly correlated with accuracy, susceptible to overconfidence bias.
  • Token logprobs. The model's per-token probabilities. Useful signal, especially for short, fact-shaped outputs.
  • Sampling agreement. Generate multiple responses and measure agreement. High agreement → likely correct.
  • Verification step. After producing an answer, run a separate prompt that critiques the answer's confidence. Adds cost; meaningfully improves calibration.
  • Retrieval signals. For RAG, the quality of retrieved chunks (similarity scores, count of relevant matches) is a proxy for confidence in the answer's grounding.

The strongest confidence signal usually combines several of these.

When Refusals Are the Right Answer

Refusals are unfashionable — they feel like the model is unhelpful. They're often correct. Cases where refusal beats best-effort:

  • High-stakes decisions where wrong answers have real consequences.
  • No retrieval signal. RAG returned nothing useful.
  • Out-of-scope questions the system isn't designed for.
  • Insufficient context. The user's question is genuinely under-specified.

Refusals should be specific, not generic:

  • Bad: "I can't help with that."
  • Better: "I don't have data on Q4 sales — only Q1–Q3 are in the documents I have access to."

A specific refusal is more actionable than a vague answer.

Calibration

A well-calibrated confidence signal means: when the model says it's 80% confident, it's right 80% of the time. Most LLMs are not well-calibrated by default — they're systematically overconfident. Calibration steps:

  • Measure on your eval set. Bin predictions by reported confidence; check accuracy in each bin.
  • Recalibrate post-hoc. Adjust the displayed confidence to match observed accuracy.
  • Choose calibration over rawness. A coarse but calibrated signal ("high / medium / low") is more useful than a precise but miscalibrated one (87.3%).

Communicating Uncertainty Without Killing UX

Hedging language gets old fast. Better patterns:

  • Confidence as visual layer. Color, weight, badges, not text.
  • Default to confident when the system has reason to be; surface uncertainty only when it's meaningful.
  • Make it actionable. "I'm not sure about this — want me to search again?" beats "I'm not sure about this."
  • Avoid false precision. "92.4% confident" implies accuracy your system doesn't have.

Done well, the user notices uncertainty exists when it matters and doesn't notice it otherwise.

The Hardest Case: Subtle Wrongness

The dangerous failure isn't the model being wrong with low confidence. It's the model being wrong with high confidence, sounding correct, on a question the user doesn't have the expertise to verify. No confidence UI fully solves this — it requires:

  • Domain-specific guardrails. Validators that check answers against authoritative sources or rules.
  • Citations users can verify. External truth, not internal confidence.
  • Critical user education. Making clear what the system can and can't do.

Confidence UI is necessary but not sufficient. The honest framing: it reduces the hazard, doesn't eliminate it.

A Quality Bar

A feature handles uncertainty well when:

  • It refuses cleanly when it should.
  • It expresses calibrated, actionable uncertainty when it should.
  • It avoids overclaiming on low-evidence answers.
  • It points users toward verification when stakes are high.

Most AI features today don't meet this bar. Building one that does is a real differentiator.

On this page