32.4 Batch processing vs interactive mode

Overview and links for this section of the guide.

Interactive Mode: Optimize for Speed

When a human is waiting (chatbot, autocomplete), latency is king. You pay a premium for immediate availability.

  • Use streaming.
  • Use smaller models if possible.
  • Keep context windows tight to reduce prefill time.

Batch Mode: Optimize for Throughput

Many AI tasks don't need to happen now. They just need to happen today.

  • Summarizing yesterday's meeting logs.
  • Tagging a backlog of 1,000 support tickets.
  • Generating unit tests for an entire legacy codebase.

For these, use Batch Processing. You send a file with 10,000 requests, go to sleep, and wake up with the results.

Using the Batch API

Google AI Studio and Vertex AI often offer a "Batch API" or deferred processing mode. The benefits are massive:

  1. 50% Lower Cost: Batch requests are often priced significantly lower than real-time requests because they run during server downtime.
  2. Higher Rate Limits: You can queue up far more tokens than your per-minute quota would allow.
  3. Reliability: The platform manages retries and queueing for you.
Night Shift

If you are building a "Vibe Coding" tool that refactors code, don't make the developer watch it write. Design a "nightly refactor" agent that runs in batch mode and opens a PR in the morning.

Where to go next