The bill arrived and nobody could explain it. We had shipped one agent into one workflow, the demo cost a few cents a run, and three weeks later finance was asking why the line item had grown a digit. The usual suspects were innocent. Traffic was flat. The model price hadn’t changed. What changed was that the thing was now doing real work, and real work means loops.
Here is the part nobody tells you when they sell you on agents. A single prompt has a cost you can hold in your head: tokens in, tokens out, done. An agent has a cost that compounds. It reads context, decides, calls a tool, reads the result, decides again, and every one of those turns drags the entire conversation so far back through the model. The transcript is the unit of work, and the transcript grows every step. You did not buy a more expensive prompt. You bought a different cost shape.
The model that actually predicts the bill
Forget per-token pricing for a minute. It tells you almost nothing about an agent. The number that matters is cost per task, and a task is a loop.
# Cost of one agent task, written the way I reason about it on a whiteboard.
# Not the way a billing dashboard slices it (per-token), which hides the loop.
PRICE_IN = 3.00 / 1_000_000 # $ per input token, whatever your model charges
PRICE_OUT = 15.00 / 1_000_000 # $ per output token, output is the expensive side
def cost_per_task(steps, base_context, growth_per_step, out_per_step):
total = 0.0
context = base_context # the system prompt + tools + task. paid EVERY step.
for _ in range(steps):
# the whole transcript so far is the input on every single turn.
# this is the line people forget. step 8 pays for steps 1 through 7 again.
total += context * PRICE_IN
total += out_per_step * PRICE_OUT
context += growth_per_step + out_per_step # tool results + the model's own reply
return total
# a "small" agent: 6 steps, a fat 12k-token system prompt and tool schema,
# each tool dump adds ~2k tokens, model writes ~400 tokens a turn.
print(cost_per_task(steps=6, base_context=12_000,
growth_per_step=2_000, out_per_step=400))
Run that and the number is bigger than your intuition. Not because any single call is expensive, but because of the inner accumulation. The base context, the system prompt and the tool definitions and the task, is paid on every step. A twelve-thousand-token tool schema is cheap once and brutal six times. The transcript growth makes it worse: each step appends the tool’s output and the model’s reasoning, so the last step pays for everything that came before it. The cost of a task is closer to quadratic in steps than linear, and that is the whole surprise in one sentence.
Then multiply. Cost per task times tasks per day is your run rate, and that part is honest arithmetic. The dishonest part lives upstream, in two multipliers people leave out of the estimate entirely.
The two multipliers that wreck the estimate
Retries. Agents fail mid-loop, a tool times out, the model returns malformed arguments, a guardrail rejects an action, and the sensible design retries. Fine. But a retry is not a cheap re-poke of the failed step. Depending on how you built it, a retry can replay a chunk of the transcript or restart the loop, and now your six-step task is a nine-step task that pays the accumulation tax again. I have seen a “ten percent failure rate” turn into a forty percent cost premium because the retries were the expensive steps, the late ones, with the fattest context.
Re-reading. This is the quiet one. An agent that loses the thread re-fetches a document it already saw, re-lists a directory, re-asks for the same data because the relevant turn scrolled out of its working attention or got summarized away. Every re-read is fresh tokens in and a longer transcript out. You are paying the model to recover state it already had. (We once found an agent re-reading the same config file eleven times in a single task because nothing told it that it already knew the answer.)
So the real model is uglier than tasks times cost-per-task. It is tasks, times cost-per-task, times a retry multiplier, times a re-read multiplier. Both multipliers are silent in the demo and load-bearing in production.
The levers, and the one that gets people fired
Now the useful part. You can bring this down a lot, and most of the room is in architecture, not in waiting for prices to drop.
Caching the stable prefix is the single highest-return move and the easiest to miss. Your system prompt and tool schema are identical on every step of every task. If your provider supports prompt caching, that fat unchanging prefix gets billed at a fraction on the repeated reads, and given how many times the loop re-reads it, that is most of your input cost handed back to you. Structure the prompt so the stable part is genuinely stable and the variable part is at the end. People put a timestamp at the top of the system prompt and quietly destroy their own cache hit rate.
Cheaper models for sub-steps. The orchestrating model does not need to be the model that summarizes a tool result or classifies which tool to call next. Route the cheap, narrow, high-frequency sub-steps to a small fast model and keep the expensive frontier model for the reasoning that actually needs it. This is the same instinct as putting your hot path on cheaper hardware, and it is the lever with the best return per hour of work.
# Pick the model by the shape of the step, not by habit.
# The reasoning step earns the expensive model. The plumbing does not.
def model_for(step_kind: str) -> str:
if step_kind in ("plan", "decide_next_tool", "final_answer"):
return "frontier" # judgment lives here. pay for it.
if step_kind in ("summarize_tool_output", "classify", "extract_field"):
return "small" # a narrow task a cheap model does fine. and fast.
return "frontier" # when unsure, don't get clever with correctness
def trim_context(transcript, keep_last=8):
# the agent does not need a verbatim replay of every tool dump from 30 steps ago.
# keep recent turns whole, compress the old ones to their decisions and results.
# this is where the quadratic gets bent back toward linear.
head = summarize(transcript[:-keep_last]) if len(transcript) > keep_last else []
return head + transcript[-keep_last:]
Context trimming is how you fight the accumulation directly. The agent rarely needs the verbatim text of a tool result from twenty steps ago. It needs the decision it made and the fact it learned. Summarize the old turns down to their conclusions, keep the recent turns intact, and the per-step context stops climbing toward the ceiling. Done carelessly this is also how you lobotomize the agent: trim the wrong thing and it forgets why it is doing the task, starts over, and re-reads everything, which costs more than you saved. Trim by relevance, never blindly by age.
Step limits are the seatbelt. A hard cap on steps per task does not make the agent smarter, it makes the failure bounded. Without it, a confused agent will loop until something external stops it, and the steps it burns while confused are the most expensive ones it will ever run. Cap it, and when it hits the cap, fail loudly to a human instead of silently to the invoice.
Batching is the lever everyone forgets because it lives at the fleet level, not the task level. If a hundred tasks each call the same classification sub-step, you do not have to run a hundred separate calls. Many of those independent sub-steps batch, and batched throughput is cheaper than the same work done one request at a time. This only shows up when you are running agents at volume, which is exactly when the bill is large enough to care.
Where this actually lands
Years ago, owning a large infrastructure P&L, I argued that FinOps is just capacity planning with a better hat. The cost was never a dashboard problem. It was an engineering-culture problem: the architecture decided the bill, and the dashboard only reported the verdict after the fact. Agents are the same idea with the dial turned up. The model price is a line item you cannot negotiate much. The loop, the context growth, the retries, the re-reads, the model routing, those are architecture, and architecture is yours.
So when the bill surprises someone, the honest answer is rarely “the model got expensive.” It is that the agent was allowed to loop without a cap, re-read without a memory, and replay a twelve-thousand-token prefix it could have cached. You can fix every one of those. None of them require a cheaper model.
The agents that survive the budget review are not the ones running on the cheapest tokens. They are the ones whose creators could draw the cost model on a whiteboard and point to exactly which step the money goes. If you cannot draw yours, that is the first thing to build, before the next feature. What does one task of your agent actually cost, and which step are you paying for twice?