LLM Project Anti-Patterns

The recurring ways LLM projects underdeliver, written down so you can recognize them in your own

This page is the synthesis of the "Pitfalls & Evaluation" sections scattered across the rest of the Solutions catalog. None of these are theoretical — they are the patterns I see repeatedly in real teams shipping real LLM features. If a project is in trouble, the failure is almost always somewhere on this list.

Scoping & Framing

The hammer-and-nail launch

A team decides "we should use AI" and then goes looking for a problem. The result is a feature that demos well and ships into a use case where the model's strengths weren't actually the bottleneck. Fix: start from a workflow that is painful today, identify what makes it painful, and only then ask which technique (LLM or otherwise) fits.

Skipping the autonomy decision

Stakeholders pitch "the AI will do it" without specifying whether it means suggest, draft, propose-with-approval, or act. The team builds for ambient expectation and ships at whatever tier happens. Fix: decide the launch tier explicitly using the Autonomy Ladder. The graduation path is a deliverable, not a hope.

Demo-driven product

The roadmap reflects what the model can do in a demo, not what users need in production. Six months later the product is impressive in screenshots and unused in practice. Fix: evaluate against real workflows, real volumes, real edge cases — before committing.

Evaluation

No gold set, no project

The team iterates prompts based on vibes. Whether each change helps or hurts cannot be measured. After three months, nobody can tell whether the system is better than it was at launch. Fix: invest in a labeled eval set before the first prompt is "tuned." Protect it from contamination.

Demoing on training-time examples

The system aces the example set because every prompt was tuned against it. Production reveals the long tail. Fix: hold out a fresh eval set the team has never seen. Refresh it periodically with samples from production.

Single-number evaluation on a multi-failure system

"Accuracy 87%" hides which classes fail. The 13% is almost always concentrated on the most important categories. Fix: per-class, per-rule, per-segment metrics. Track the worst slice, not the average.

LLM-as-judge calibrated against itself

The model that does the work also grades the work. Both can be wrong in the same direction and nobody knows. Fix: calibrate the judge against domain-expert labels regularly; track judge-vs-human agreement; rotate models for the judge role.

Grounding & Trust

Confident hallucination

The model invents facts (citations that don't exist, action items nobody proposed, fields not present in the source). Fix: require grounding — every claim cites a region of the source; reject claims without citations; build the pipeline to drop unsupported outputs by default.

Ungrounded math

The LLM reports a "total" that's not the sum of the line items it just extracted. Fix: never trust LLM-generated numbers from source documents. Re-compute totals deterministically from extracted fields.

Citation that doesn't resolve

URLs / paper titles / file paths that look real but aren't. Common in research-style and code-aware workflows. Fix: every citation is re-fetched and verified before display.

"Acknowledge what you don't know" is not training

Telling the model "say I don't know if unsure" in the prompt is necessary and insufficient. Without refusal examples in eval and explicit policy enforcement, models default to confident answers on out-of-scope questions. Fix: include refusal cases in the eval; reject answers that aren't supported.

Production & Operations

Premature Tier 4

Going from "demo looks great" to "auto-submit / auto-send / auto-act" without observed agreement on real volumes. Recovery is public. Fix: see Autonomy Ladder graduation criteria. Promotion is earned, not declared.

Permanent Tier 1

The opposite: a working copilot that never graduates because nobody owns the graduation work. Capability is real, business value compounds slowly or not at all. Fix: assign owner + timeline + criteria for each promotion step.

No observability

No logs, no traces, no per-event capture of "what did the model see, what did it say, what tools did it call." Debugging is impossible; improving is guessing. Fix: structured trace per interaction from day one; treat it like an APM stack.

Silent format drift

Vendor changes invoice template, content provider changes web layout, partner changes schema. Accuracy quietly slides. Fix: monitor per-template / per-source metrics; alert on output-distribution change.

Tool that lies

A create_X tool returns success when nothing happened, or success when the wrong thing happened. The agent reports "done." Fix: verify critical side effects post-hoc with a read; treat tool returns as untrusted for state-changing operations.

Unbounded agent loops

The agent runs out of context, or burns through budget, or repeats itself. Fix: hard caps on steps, tokens, wall-clock; treat exceeding the cap as a failure that escalates to a human.

Cost & Latency

Big model for every step

The system uses the most expensive model for every call, including ones that a smaller model handles fine. Costs become a blocker. Fix: tier models by step difficulty; reserve the strong model for the hard parts; cache aggressively.

Synchronous when async is fine

A user is staring at a spinner for an operation they would happily run in the background. Fix: batch APIs and async patterns where the user doesn't need an immediate answer. Latency is for interactive; throughput is for backfill.

No caching on stable input

The same prompt with the same input is sent on every render. Prompt caching, retrieval caching, and embedding caching are all available and unused. Fix: cache at the layer where input is most stable.

UX

Streaming as a feature, not as a fix

Streaming output makes a 10-second response feel acceptable. It doesn't make a wrong answer right. Teams sometimes ship streaming and consider latency "solved." Fix: profile actual time-to-useful, not time-to-first-token.

The blank-page trap

User opens the AI feature, sees an empty prompt, freezes, closes the tab. Fix: seed concrete starting points the user can edit. The cheapest way to look smart is to start the conversation.

Conversational by default for non-conversational tasks

Some tasks are forms, not chats. Forcing a dialog adds friction and removes affordances. Fix: match interaction shape to task shape. Reach for chat only when the user's intent is genuinely open-ended.

Unbounded helpfulness

The model promises things the system can't deliver (prices it doesn't know, eligibility it can't check, dates without a calendar tool). Fix: ground concrete claims in tool calls; refuse claims without a source.

Safety & Security

Treating untrusted input as instructions

User-uploaded documents, inbound emails, web pages, ticket comments — all can contain prompt injection. A system that flattens them into the prompt is vulnerable. Fix: separate channels for system instructions vs. content to be processed; never let content carry authority over the agent.

Permission leakage in retrieval

A knowledge assistant returns content the user couldn't otherwise read. Fix: index per ACL; enforce permissions at retrieval, never via prompt filter.

Sensitive data in logs

Full prompt logs in plaintext, customer data in eval sets. Fix: redact at log time; treat eval sets as production data.

Autonomous handling of catastrophic classes

CSAM, credible threats, financial fraud at scale, infrastructure destruction — never autonomous regardless of model quality. Fix: human-in-the-loop with appropriate reporting obligations. Model quality is irrelevant to this requirement.

Organization

One team, one model, one prompt

A single team owns the LLM call for every use case; prompts are shared across unrelated workflows; one model regression breaks ten features. Fix: treat each use case as its own product surface; isolate prompts, evals, and model choice per workflow.

Engineering ships, nobody evaluates

The engineering team ships the feature and moves on; no analyst, ops manager, or SME owns ongoing quality. Quality regresses silently. Fix: ongoing evaluation has an owner; review cadence is on the calendar.

Excitement window expires

Six months in, the original sponsor moves on, the project's slot in roadmap reviews vanishes, the team is reassigned. The system stays in production, unmaintained. Fix: name an operational owner separate from the launch team. Budget for maintenance, not just launch.

How to Use This Page

When a project feels stuck or under-delivering, run down this list. The diagnosis is usually one of: no eval, no grounding, wrong autonomy tier, untrustworthy tool, or no owner of ongoing quality. Almost everything else is symptom.

On this page