Steven's Knowledge

Cost Optimization

The token bill scales with success — design for it from the start

LLM features have a cost curve unlike most software: every active user adds direct, marginal, per-call cost. A feature that's cheap to build can become expensive to run if usage takes off. Cost engineering isn't optional once you cross any meaningful traffic threshold.

Where Spend Goes

For most systems, cost breaks down roughly like:

  • Input tokens — system prompts, tool definitions, retrieved context, conversation history.
  • Output tokens — the model's response. Often the largest single line item once you account for streaming.
  • Repeated work — re-running expensive prompts that could have been cached.
  • Wasted capacity — frontier models doing tasks a smaller model would handle.

Each maps to a different lever.

Prompt Caching

Every major provider now offers prompt caching: stable prefixes are charged at a steep discount (often 10x cheaper) on repeat calls. The structural rule: put stable content first, variable content last.

[SYSTEM PROMPT]        ← stable, cached
[TOOL DEFINITIONS]     ← stable, cached
[RETRIEVED CONTEXT]    ← session-stable, cached
[CONVERSATION]         ← partially cached
[CURRENT TURN]         ← not cached

For chat-shaped or agentic workloads, caching alone often cuts spend by more than half.

Model Routing

Don't pay frontier prices for tasks a fast model handles. A common pattern:

  1. Classify the request with a small, fast model.
  2. Send simple cases to a cheap model.
  3. Escalate to frontier only for hard ones.

Even crude routing — "if the input is under 200 tokens and the task is classification, use the small model" — saves real money. The classifier itself is part of the cost; make sure the savings exceed it.

Output Length Control

Output tokens are usually the most expensive per token and dominate latency. Controls that help:

  • Max tokens — set it. Don't leave it unbounded.
  • Structured output — schemas constrain output and prevent rambling.
  • "Be concise" really does work — explicit length instructions reduce output by 20–40% on average.
  • Stop sequences — terminate generation as soon as the structural marker appears.

Batching

For non-interactive workloads (offline evals, bulk processing, embedding generation), use batch APIs. Anthropic, OpenAI, and Google all offer ~50% discounts in exchange for higher latency tolerance.

Trim the Prompt

Periodically audit prompts for things that grew over time:

  • Examples that no longer help.
  • Tool definitions for tools the model never calls.
  • Boilerplate instructions that don't change behavior.
  • Retrieved chunks that don't get cited.

A 30% reduction in input tokens for a heavily-used endpoint compounds quickly.

Embedding the Right Things

If you're building retrieval, embedding cost can rival generation cost. Reduce by:

  • Embedding only what's queried — don't embed your whole database if 5% gets queried.
  • Reusing embeddings — when documents change, embed only the diff.
  • Choosing dimensions — many embedding models support truncation; smaller dimensions cost less to store and search.

When Cost Caps Become Product Decisions

Hard caps eventually inform product design: rate limits per user, paywalls on heavy features, model-tier selection exposed to the user. Better to plan for these early than retrofit them when the bill arrives.

On this page