The cheapest path is not always the right path. LLM cost optimization is about cost per accepted outcome, not lowest-possible spend. A cheaper model that causes repair loops or review debt can cost more in total than the model it replaced.
What LLM cost optimization actually means
LLM cost optimization is not a race to the cheapest model. It means reducing cost per accepted output — the combination of model cost, retry cost, review burden, and the engineering time spent on prompts and routing. A workflow that uses a premium model and produces clean, accepted output on the first attempt can be cheaper in total than one that routes to the cheapest option and triggers repeated repair loops.
- Measure cost per accepted task, not cost per API call.
- Include retry cost, review time, and rework in the full picture.
- Cheapest model is only the right default when quality is provably equivalent.
On this siteTokenmaxxing vs. AI outcomes — the right metricsHow to track AI token spend
Model routing: match capability to task risk
Routing is the highest-leverage lever for most teams. The principle is simple: not every step needs the strongest model. Classification, extraction, formatting, rewriting, and low-risk planning are common candidates for cheaper models. Judgment-heavy, customer-visible, or high-stakes steps get the stronger route. The saving compounds in agent workflows where dozens of small calls accumulate.
- Start with a two-tier policy: cheap default, expensive on failure or uncertainty.
- Route by task risk and past failure mode, not by habit or convenience.
- Log the routing decision so cost and quality changes are explainable.
- Measure acceptance rate by route before declaring the cheaper model equivalent.
ReceiptsOpenRouter model pricingAnthropic model overview
On this siteModel routing playbookModel cost leaderboardOpenRouter rankings
Prompt caching: stop paying for repeated context
Providers including Anthropic and OpenAI offer prompt caching that lets repeated prefix context — system prompts, documents, instruction blocks — be served from cache at a fraction of the input token price. If your workflow sends the same large prompt prefix across many calls, caching can reduce input costs by 50–90% on those prefixes. The practical requirement is that cached content must be deterministic and safe to reuse across requests.
- Identify large, stable system prompts and instruction blocks first — these are the highest-value cache targets.
- Check whether the provider's cache TTL fits your request volume (short TTLs waste cache misses).
- Do not cache content that is personalized, session-specific, or stale-sensitive.
Context trimming: cut what the model does not need
Most waste in prompts comes from context that felt safer to include than to think about. Whole file contents, long conversation histories, irrelevant retrieved documents — each adds input tokens on every call. Retrieval replaces bulk loading with targeted chunks. Task decomposition splits a giant context into smaller, cheaper steps. Summarization compresses stale history before it becomes expensive to carry.
- Use retrieval to send targeted chunks rather than every document in the directory.
- Summarize or prune conversation history before it grows beyond the task's actual need.
- Split complex tasks rather than loading a single context window with everything.
- Test whether the model actually uses the context you are sending before assuming it is needed.
On this siteHow to reduce wasted LLM tokens
Output-length control: pay only for what you need
Output tokens cost more per token than input tokens on most provider pricing. If your workflow asks for longer outputs than it actually uses, output-length constraints are a direct cost lever. Structured output formats (JSON, short fields, templates) reduce the model's tendency to fill space with explanation. Explicit length instructions in the prompt signal the target without requiring guardrails.
- Add explicit length guidance to prompts where long outputs are not needed.
- Use structured output schemas to prevent verbose filler in extraction and classification tasks.
- Compare output token counts before and after constraint changes using the same eval set.
Batching: defer non-urgent work
Most providers offer batch APIs that process requests asynchronously at a lower per-token price — typically 50% off for Anthropic's and OpenAI's batch endpoints. Batch processing is a good fit for classification, enrichment, embedding, summarization, and evaluation jobs where latency does not matter. It is a poor fit for interactive workflows where users wait for a response.
- Route non-interactive jobs — enrichment, classification, eval scoring, report generation — through batch APIs.
- Batch endpoints usually have longer SLAs (hours, not seconds) — confirm your workflow tolerates this.
- Combine batching with routing: send batch jobs to cheaper models where quality allows.
Retrieval to cut context costs
Retrieval-augmented generation (RAG) is one of the most direct ways to reduce context costs: instead of loading a large document corpus into every prompt, the retrieval layer finds and sends only the relevant chunks. The savings compound across agent workflows where repeated tool calls would otherwise load the same documents repeatedly. The trade-off is retrieval latency and the quality of the retrieval system itself.
- Use vector search or hybrid retrieval to send targeted passages rather than full documents.
- Evaluate retrieval recall before relying on it for quality-sensitive tasks.
- Cache retrieval results for repeated queries to avoid redundant vector lookups.
On this siteBest open-source tools for LLM token usage
Eval-gating: do not over-pay for quality you do not need
Every optimization that routes work to a cheaper model or shorter context risks quality regression. The control that makes routing safe is an eval set: a representative sample of inputs with known-good outputs that you can run before and after any routing change. Without evals, a routing decision that looks like a cost win might be hiding acceptance rate drops, more human edits, or higher escalation rates.
- Build a small eval set for each workflow before changing models or prompts.
- Track acceptance rate, edit rate, and escalation rate by route alongside cost.
- Do not approve a routing change that saves 20% on tokens but costs 30% more in rework.
On this siteAgent token burn — how agent loops multiply costs
Where to start: the two-question audit
Before implementing any optimization, answer two questions: which workflow costs the most, and what are the tokens in that workflow actually doing? The answer almost always points to one lever — bloated context, expensive routing for a low-risk step, repeated uncached calls, or an agent loop without a stop condition. Start there, measure, then consider a second lever.
- Sort workflows by total spend and look at the top three.
- Pull a trace from the most expensive workflow and label each step: context load, model call, retry, tool use, output.
- Pick the lever that addresses the biggest waste without a quality trade-off you cannot measure.
Frequently asked questions
What is LLM cost optimization?
LLM cost optimization means reducing what you spend per accepted AI output. The main levers are model routing (send cheap model where quality holds), prompt caching (stop paying for repeated context), context trimming (load only what the model needs), batching (async processing at lower rates), and eval-gating (confirm quality before committing to a cheaper route).
Does cheaper always mean better for LLM cost optimization?
No. A cheaper model that causes more retries, more human edits, or more escalations can cost more in total than the model it replaced. The right metric is cost per accepted task, not cost per token.
What is prompt caching and does it reduce LLM costs?
Prompt caching lets you reuse repeated prefix context — system prompts, large instruction blocks, document context — at a fraction of the normal input token price. Both Anthropic and OpenAI offer this. If your workflow sends the same large prompt prefix across many calls, caching can cut input costs by 50–90% on those prefixes.
What is model routing in the context of LLM cost optimization?
Model routing means choosing which model to call for each step of a workflow instead of always using the same default. Cheap models handle low-risk steps; stronger models are reserved for hard or high-stakes steps. The saving compounds in agentic workflows where dozens of calls accumulate.
How do batch APIs reduce LLM costs?
Batch APIs process requests asynchronously and charge roughly half the per-token price of synchronous calls. They are good for classification, enrichment, evaluation scoring, and report generation where latency does not matter. They are a poor fit for interactive user-facing workflows.
What does eval-gating mean for LLM cost optimization?
Eval-gating means running a quality check before committing to a cheaper routing or prompt change. It prevents a cost optimization that looks like a win on the invoice from hiding a drop in acceptance rate, an increase in human edits, or a rise in escalations.
