9.5 Cost intuition: tokens, context size, and repeated calls
Overview and links for this section of the guide.
On this page
Why cost surprises happen
Cost surprises usually come from multiplication, not from one big request:
- long prompts repeated across many calls,
- retry storms on failures,
- large context windows used by default,
- chat history ballooning quietly,
- high-volume usage once a prototype becomes a product.
This page gives you a practical intuition so you can predict cost before it bites you.
Stop thinking “one call.” Start thinking “calls per success” and “tokens per success.”
Tokens: the unit you’re really paying for
Most LLM pricing and quotas are based on tokens. A token is roughly a chunk of text. You typically pay for:
- input tokens: prompt + context,
- output tokens: model response.
Even without exact prices, you can reason about relative cost by measuring token sizes and call counts.
The main cost drivers
1) Context size
Stuffing large documents or long chat history into every call is the fastest way to inflate cost. If you include the same 20k tokens every time, you’re paying for them every time.
2) Output length
Long outputs cost more and usually take longer. Tight schemas and concise outputs reduce both cost and latency.
3) Retries and failures
Retries can double or triple cost if you don’t cap them and back off. Rate limits and timeouts create “invisible cost” via repeated attempts.
4) Model choice
Smarter models are often more expensive per token and may encourage longer reasoning outputs. Route tasks deliberately.
Overly verbose prompts and long “roleplay” instructions are cost multipliers. For code, concise constraints + acceptance criteria are cheaper and more effective.
A simple cost model you can use
You can model cost with a simple formula:
tokens_per_call = input_tokens + output_tokens
tokens_per_success = tokens_per_call * calls_per_success
monthly_tokens = tokens_per_success * successful_tasks_per_month
You don’t need exact prices to make good decisions. You need to notice what multiplies tokens and what increases calls per success.
Cost-safe habits (that don’t slow you down)
- Summarize state: don’t paste entire histories; keep a stable state block.
- Constrain outputs: schemas, checklists, short responses.
- Cache and deduplicate: if inputs repeat, don’t pay twice (see Part II 4.3 for reliability patterns).
- Cap retries: exponential backoff, max attempts, jitter.
- Route tasks: use cheaper models for batch and routine tasks.
- Measure “cost per success”: not “cost per call.”
When costs spike, apps slow down, time out, and start failing. Cost and reliability are the same problem from two angles.
Cost and latency are linked
In many systems, the same factors increase both cost and latency:
- larger prompts,
- larger outputs,
- more retries,
- higher-concurrency bursts.
That’s why “optimize prompt size” is both a performance optimization and a cost optimization.