4.3 Rate limits and quotas (why your prototype suddenly fails)

On this page

Why prototypes “suddenly fail”
Rate limits vs quotas (clear definitions)
What to measure (the few metrics that matter)
Common symptoms and what they mean
Mitigation patterns that actually work
Retries: how to do them safely
Timeouts: how to avoid hanging your app
Caching and deduplication (save cost + latency)
Burst control and concurrency limits
Designing your UX for failures
A debugging checklist (fast diagnosis)
Copy-paste templates
Where to go next

Why prototypes “suddenly fail”

Early prototypes often work once or twice and then start failing “randomly.” This is usually not a code bug. It’s the system pushing back with limits: rate limiting, quotas, timeouts, or cost constraints.

AI-assisted apps are especially susceptible because:

Calls can be expensive (large prompts, large outputs).
Latency can be variable (model load, contention, network).
Developers accidentally create bursts (refresh loops, retries, parallel requests).
Prototypes often lack timeouts, retries, and backoff.

The healthy mindset

Assume limits will happen. Design your app so limits are visible, handled, and recoverable—not surprising and catastrophic.

Rate limits vs quotas (clear definitions)

These terms are often conflated. Separating them makes troubleshooting easier.

Rate limits

Rate limits constrain how quickly you can make requests. They’re usually defined over time windows:

requests per second/minute,
tokens per minute,
concurrent requests.

Rate-limit failures tend to be bursty: you exceed a short window and get throttled.

Quotas

Quotas constrain total usage over longer windows (or by account/project policy):

daily/monthly usage caps,
spend limits,
project-level limits.

Quota failures tend to persist until the window resets or the quota is increased.

Quick diagnosis

If it fails only under bursts and later works again, suspect rate limiting. If it fails consistently until a reset, suspect quota.

What to measure (the few metrics that matter)

You don’t need a full observability stack to understand limits. You need a few basic measurements:

Request rate: how many requests are you making per minute?
Token usage: prompt tokens + output tokens (approximate is fine early).
Latency: p50/p95 request time; spikes are important.
Error rate: how often requests fail, and with what error codes?
Concurrency: how many requests are in flight at once?

Once you can see these, most “mystery failures” become obvious.

The first observability win

Log (1) a request ID, (2) latency, (3) outcome, (4) error category. That alone is enough for most early debugging.

Common symptoms and what they mean

Symptom: lots of 429 / “too many requests” errors

Likely cause: rate limiting (requests per minute, tokens per minute, concurrency).
Fix: reduce concurrency, add backoff retries, add client-side throttling.

Symptom: timeouts or hanging requests

Likely cause: no timeout configured, slow model response, network issues.
Fix: add explicit timeouts; reduce prompt size; implement retries for transient failures.

Symptom: works once, then fails after a few calls

Likely cause: quota/rate-limits or cost caps.
Fix: reduce call frequency, shrink prompts, cache, and check quotas.

Symptom: suddenly slow (but not failing)

Likely cause: large prompts, long outputs, contention, or fallback behavior.
Fix: reduce prompt size; stream output (if supported); cache; parallelize only what’s safe.

Symptom: cost spike

Likely cause: repeated calls with large context or “retry storms.”
Fix: cache/deduplicate, cap retries, reduce context, measure tokens per success.

Retry storms are real

If you retry blindly on 429/timeouts, you can make the problem worse and increase cost. Retries must include backoff and limits.

Mitigation patterns that actually work

These patterns are boring—and extremely effective.

1) Reduce prompt size

Remove irrelevant context.
Summarize state instead of pasting long history.
Use schemas and concise instructions.
Move long docs into retrieval (later) instead of stuffing into prompts.

2) Limit concurrency

Many prototypes accidentally create parallel calls (UI double submits, background refresh, batching). Cap concurrency explicitly.

3) Add retries with exponential backoff

Retries help for transient failures, but only when implemented safely (see below).

4) Add timeouts

Without timeouts, your app can hang and create cascading failures.

5) Cache and deduplicate

If the same prompt + context repeats, caching is often the biggest cost/latency win.

6) Build fallbacks

Use a smaller model when the main model is overloaded.
Provide partial results if full completion fails.
Return a “try again” state with a clear reason and next step.

Order of operations

First: reduce prompt size and concurrency. Then add retries/timeouts. Then consider caching and fallbacks.

Retries: how to do them safely

Safe retries require three constraints:

Only retry retryable failures: throttling, transient network issues, some timeouts.
Backoff: wait longer between attempts (exponential backoff).
Cap attempts: stop after N tries and return a clear error/fallback.

What not to do

Don’t retry instantly.
Don’t retry forever.
Don’t retry non-retryable errors (bad auth, invalid request).

Idempotency matters

If a call triggers side effects (tool calls, payments, writes), retries can duplicate actions. Design idempotency or avoid automatic retries for side-effecting operations.

Timeouts: how to avoid hanging your app

Timeouts are a core reliability primitive. Without them:

requests pile up,
your app becomes slow or unresponsive,
you hit concurrency limits faster,
users spam refresh (creating bursts).

Practical guidance

Set a reasonable timeout for model calls.
Differentiate between “hard timeout” (stop) and “soft timeout” (fallback).
Expose timeout failures clearly to users (and to your logs).

Timeouts improve UX

A fast failure with a clear retry message is better than a spinning UI that never finishes.

Caching and deduplication (save cost + latency)

Caching is one of the highest-leverage optimizations for LLM apps, but it needs care.

What to cache

Identical prompt + identical inputs.
Deterministic or near-deterministic tasks (lower temperature).
Intermediate results (summaries, extracted fields) when safe.

What not to cache

Requests containing sensitive user data unless you have a clear privacy policy and secure storage.
Highly variable tasks where diversity is the point (brainstorming).
Outputs that depend on time-sensitive context (unless you include time in the cache key).

Deduplication

If two identical requests arrive simultaneously, run one and share the result. This prevents burst amplification.

Caching can leak data if done carelessly

Cache keys and stored outputs can contain sensitive info. Treat caching as a security feature, not just a performance feature.

Burst control and concurrency limits

Most throttling issues come from bursts, not steady-state usage.

Client-side throttling: limit how often the UI can submit.
Server-side queueing: cap concurrent calls and queue overflow.
Backpressure: when overloaded, return a clear “try again” response.

If you do only one thing, cap concurrency. It’s often the biggest stability win.

A good default

Start with a small concurrency limit and increase only after measuring success rate and latency.

Designing your UX for failures

Your app should assume LLM calls will sometimes fail. Good UX makes failures non-catastrophic:

Show progress and allow cancellation.
Explain failures clearly (timeout, rate limit, invalid output).
Offer a retry button that respects backoff (don’t encourage spam clicks).
Provide partial output if possible (or a safe fallback response).
Log enough detail for debugging without leaking sensitive info.

Refusal-aware and failure-aware UX

Safety blocks, invalid JSON, timeouts, and rate limits are normal. Design your UI so they don’t feel like bugs.

A debugging checklist (fast diagnosis)

When your prototype starts failing, run this checklist:

Reproduce with a single request: remove concurrency and loops.
Check error codes/messages: especially 429s and quota-related errors.
Measure latency and token size: is your prompt/output huge?
Check retries: are you retrying too aggressively?
Check concurrency: are multiple requests in flight?
Check caching: are you making identical calls repeatedly?
Confirm credentials/project: quota issues can look like auth issues after switching projects/accounts.

Don’t debug in the dark

If you aren’t logging error categories and latency, you’re guessing. Add minimal logging and the problem usually becomes obvious.

Copy-paste templates

Template: retry policy statement

Retry policy:
- Retryable: 429, transient network errors, timeouts
- Max attempts: 3
- Backoff: exponential (e.g. 250ms, 500ms, 1s) + jitter
- Non-retryable: auth errors, invalid requests, schema validation failures
- Log: request id, attempt count, error category, latency

Template: minimal request logging fields

Log fields:
- request_id
- model
- prompt_version
- input_size (approx)
- latency_ms
- outcome (ok/error)
- error_category (rate_limit/quota/timeout/auth/invalid_output/unknown)

Template: failure-aware UX copy

We couldn’t complete this request (rate limit / timeout).
Please wait a moment and try again.
If this keeps happening, reduce input size or try later.

4.3 Rate limits and quotas (why your prototype suddenly fails)

Why prototypes “suddenly fail”

Rate limits vs quotas (clear definitions)

Rate limits

Quotas

What to measure (the few metrics that matter)

Common symptoms and what they mean

Symptom: lots of 429 / “too many requests” errors

Symptom: timeouts or hanging requests

Symptom: works once, then fails after a few calls

Symptom: suddenly slow (but not failing)

Symptom: cost spike

Mitigation patterns that actually work

1) Reduce prompt size

2) Limit concurrency

3) Add retries with exponential backoff

4) Add timeouts

5) Cache and deduplicate

6) Build fallbacks

Retries: how to do them safely

What not to do

Timeouts: how to avoid hanging your app

Practical guidance

Caching and deduplication (save cost + latency)

What to cache

What not to cache

Deduplication

Burst control and concurrency limits

Designing your UX for failures

A debugging checklist (fast diagnosis)

Copy-paste templates

Template: retry policy statement

Template: minimal request logging fields

Template: failure-aware UX copy

Where to go next