Guide

Agent Token Burn Explained

Why AI agents can spend tokens unpredictably, and how teams can control long-running coding, research, and tool-using workflows.

Updated 2026-05-12agents / coding-agents / token-consumption
Desk note

Agent spend is different because the model is not called once. It plans, reads, calls tools, retries, summarizes, and sometimes loops. The trace is the unit of accountability.

Agents multiply calls

A normal prompt may be one request and one response. An agent may plan, inspect files, call tools, revise, retry, and summarize. Each step adds token cost and can carry previous context forward.

  • Count calls per task, not only tokens per call.
  • Preserve step order so the trace is reviewable.

Context grows over time

When agents carry too much history or irrelevant file context, every new step becomes more expensive before the model writes a useful answer. Context hygiene matters more as the trace gets longer.

  • Summarize or prune trace state deliberately.
  • Retrieve files by task instead of loading broad directories.

Retries are hidden spend

A failed edit, malformed tool call, ambiguous instruction, or flaky API can trigger repeated attempts. From the outside it looks like progress; inside the trace it is often a cost leak.

  • Alert on repeated tool errors.
  • Cap retries and require a new plan after failure.

Controls that work

The useful controls are concrete: task budgets, step limits, model routing, trace review, evals, and human-in-the-loop checkpoints for high-risk work. A controlled agent should be able to explain why it continued, why it stopped, and what output was accepted.

  • Use cheaper models for low-risk subtasks only after evals.
  • Report accepted task rate alongside spend.
Source trail

Current feed records connected to this guide

Augment Code source artwork
newsAC
news

5 Best Model Routing Platforms for AI Agent Systems

Augment Code rounds up model routing options for agent systems - tools that decide which model to call per step to balance quality, latency, and cost.

tokenmaxxingagentstoken-consumption
Read note
Augment Code source artwork
guideAC
guide

Multi-Agent Cost Compounding: Why 3 Agents Cost 10x

Augment Code breaks down why adding agents can explode costs: orchestration overhead, context handoffs, retries, and verification loops often dominate raw model pricing.

tokenmaxxingagentstoken-consumption
Read note
Generated Tokenmaxxing editorial thumbnail for Anthropic tightens limits on Claude subscriptions - Axios
newsA
news

Anthropic tightens limits on Claude subscriptions - Axios

Axios reports Anthropic is tightening what paid Claude subscribers can do, shifting heavy third-party agent usage behind a separate credit meter.

tokenmaxxingcoding-agentsagents
Read note
Project layer

Tools that make the guide operational

#4In spirit
Agents

LangGraph

langchain-ai/langgraph

A framework for building resilient stateful agents with explicit graphs, persistence, human-in-the-loop flows, and controllable execution.

32.6K5.5KMIT
agentsstateworkflows
#15In spirit
Agents

Zep

getzep/zep

A memory layer and integration collection for AI agents and knowledge-graph-backed language-model applications.

4.6K627Apache-2.0
memoryagentsknowledge-graph
#1Direct
Routing

LiteLLM

BerriAI/litellm

An OpenAI-compatible gateway and SDK for calling many model providers with budgets, logging, load balancing, guardrails, and cost tracking.

47.8K8.2KSource-available
gatewaycost-trackingrouting
Briefing

Fresh source notes each week.

New tokenmaxxing links, model-router signals, agent usage research, and AI cost notes.