9.5 Cost intuition: tokens, context size, and repeated calls

On this page

Why cost surprises happen
Tokens: the unit you’re really paying for
The main cost drivers
A simple cost model you can use
Cost-safe habits (that don’t slow you down)
Cost and latency are linked
Where to go next

Why cost surprises happen

Cost surprises usually come from multiplication, not from one big request:

long prompts repeated across many calls,
retry storms on failures,
large context windows used by default,
chat history ballooning quietly,
high-volume usage once a prototype becomes a product.

This page gives you a practical intuition so you can predict cost before it bites you.

The mental shift

Stop thinking “one call.” Start thinking “calls per success” and “tokens per success.”

Tokens: the unit you’re really paying for

Most LLM pricing and quotas are based on tokens. A token is roughly a chunk of text. You typically pay for:

input tokens: prompt + context,
output tokens: model response.

Even without exact prices, you can reason about relative cost by measuring token sizes and call counts.

The main cost drivers

1) Context size

Stuffing large documents or long chat history into every call is the fastest way to inflate cost. If you include the same 20k tokens every time, you’re paying for them every time.

2) Output length

Long outputs cost more and usually take longer. Tight schemas and concise outputs reduce both cost and latency.

3) Retries and failures

Retries can double or triple cost if you don’t cap them and back off. Rate limits and timeouts create “invisible cost” via repeated attempts.

4) Model choice

Smarter models are often more expensive per token and may encourage longer reasoning outputs. Route tasks deliberately.

Watch for “chatty prompts”

Overly verbose prompts and long “roleplay” instructions are cost multipliers. For code, concise constraints + acceptance criteria are cheaper and more effective.

A simple cost model you can use

You can model cost with a simple formula:

tokens_per_call = input_tokens + output_tokens
tokens_per_success = tokens_per_call * calls_per_success

monthly_tokens = tokens_per_success * successful_tasks_per_month

You don’t need exact prices to make good decisions. You need to notice what multiplies tokens and what increases calls per success.

Cost-safe habits (that don’t slow you down)

Summarize state: don’t paste entire histories; keep a stable state block.
Constrain outputs: schemas, checklists, short responses.
Cache and deduplicate: if inputs repeat, don’t pay twice (see Part II 4.3 for reliability patterns).
Cap retries: exponential backoff, max attempts, jitter.
Route tasks: use cheaper models for batch and routine tasks.
Measure “cost per success”: not “cost per call.”

Cost bugs look like product bugs

When costs spike, apps slow down, time out, and start failing. Cost and reliability are the same problem from two angles.

Cost and latency are linked

In many systems, the same factors increase both cost and latency:

larger prompts,
larger outputs,
more retries,
higher-concurrency bursts.

That’s why “optimize prompt size” is both a performance optimization and a cost optimization.

9.5 Cost intuition: tokens, context size, and repeated calls

Why cost surprises happen

Tokens: the unit you’re really paying for

The main cost drivers

1) Context size

2) Output length

3) Retries and failures

4) Model choice

A simple cost model you can use

Cost-safe habits (that don’t slow you down)

Cost and latency are linked

Where to go next