29.3 Streaming responses and partial rendering
Overview and links for this section of the guide.
On this page
Goal: improve perceived latency without sacrificing safety
Users care about “time to first useful signal,” not just total completion time.
Streaming improves perceived latency by showing partial progress. But streaming also creates new risks: you might show incorrect or unsafe content before validation.
The goal is to stream carefully:
- stream safe partial output when appropriate,
- validate final outputs,
- handle mid-stream failures gracefully,
- avoid leaking unverified claims.
What streaming is good for
- Perceived speed: first tokens arrive quickly.
- Progress visibility: users see that the system is working.
- Long answers: user can start reading immediately.
- Interruptibility: users can cancel when they’ve seen enough.
Streaming risks (and where teams get burned)
- No validation: you can’t validate JSON schema until the stream ends.
- Unsafe partial output: you might display disallowed content before filters.
- Wrong-but-confident early text: users may act on the first thing they read.
- Mid-stream failures: network drops, timeouts, or rate limits leave users with half an answer.
If users can see partial output, you need clear states: “draft,” “final,” “needs verification,” and “failed.” Otherwise users treat partial output as truth.
Streaming patterns that work
Three practical patterns:
Pattern A: stream a “draft,” then confirm “final”
- Stream text as it arrives.
- When complete, run validation/post-processing.
- Mark as “final” only after validation passes.
This is simplest for chat-like experiences where strict structure is not required.
Pattern B: two-phase output (plan first, then answer)
- Phase 1 (stream): show a short plan or outline (safe, high-level).
- Phase 2 (non-stream or delayed stream): show the final answer after validation.
This gives early progress without streaming the riskiest content.
Pattern C: stream progress events, not content
For high-risk systems, stream status:
- “retrieving sources…”
- “ranking sources…”
- “generating answer…”
- “validating…”
Then show the final validated answer. This is often the best default for grounded systems.
Streaming with structured output (hard mode)
Streaming strict JSON is hard because partial JSON is usually invalid.
Practical strategies:
- Don’t stream JSON: stream progress and return JSON at the end.
- Stream line-delimited JSON events: each event is valid JSON (requires strict format control).
- Stream an outline, then JSON: show a human-readable outline early, return machine output last.
If your system requires schema validation and citations, the simplest safe approach is: stream progress + final validated JSON.
UX: progress, cancellation, and partial failures
Streaming UX needs explicit states:
- In progress: show status and partial content (if used).
- Cancelable: user can stop the request (and you cancel upstream work).
- Completed: validated output is displayed.
- Failed: show a safe fallback and next actions.
Design partial failure behavior:
- Retry: one automatic retry if safe and within budget.
- Fallback: show sources/links or a minimal response.
- Explain: “connection lost” vs “rate limited” messaging.