Token reduction is only a win when accepted output holds. The target is not smaller prompts for their own sake; it is less repeated, irrelevant, or repair-heavy work.
Reduce context before ambition
Most waste starts with context discipline. Teams send whole files, long histories, and irrelevant documents because it feels safer than retrieval or task decomposition. The result is expensive calls that are harder to inspect.
- Split tasks before sending giant context windows.
- Use retrieval to send targeted chunks rather than every document.
Route simple work down
Not every step needs the strongest model. Classification, extraction, formatting, low-risk planning, and validation are common candidates for cheaper routes once evals prove the quality bar holds.
- Route by task risk, not by habit.
- Keep a fallback path when confidence is low.
Stop paying for repeated work
Semantic caching, prompt normalization, deterministic pre-processing, and saved intermediate results can prevent teams from generating the same expensive answer again and again.
- Start with the most repeated expensive calls.
- Cache only where freshness and permissions are understood.
Constrain agents
Agents need explicit budgets: step limits, stop conditions, retry caps, tool budgets, and escalation rules. Otherwise a vague task can become a long trace that looks busy while it burns through model calls.
- Require a stopping reason on each trace.
- Alert on retry loops and long-running tasks.