13.3 Basic request/response wrapper architecture
Overview and links for this section of the guide.
On this page
- Goal: isolate model calls behind one boundary
- Why wrapper architecture matters
- A practical layering model
- Design the wrapper interface (inputs/outputs)
- Prompt packaging (versions, metadata, structure)
- Validation: treat outputs as untrusted
- Testing: dependency injection and fakes
- Recommended file structure
- Copy-paste templates
- Where to go next
Goal: isolate model calls behind one boundary
As soon as your app makes model calls, you need a boundary so the rest of your code doesn’t turn into “prompt glue.”
The goal of this chapter is to build a single module (or small package) that is the only place that:
- knows how to call the model API,
- knows about retries/timeouts,
- knows about schemas and output validation,
- knows how to log model-call metadata safely.
Everything else should treat the model call like a normal dependency with a small interface.
If model calls are scattered across handlers and controllers, your app becomes hard to debug, hard to test, and expensive to change. A wrapper boundary prevents that.
Why wrapper architecture matters
LLM calls behave differently from normal library functions:
- they can be slow,
- they can fail transiently (rate limits, timeouts),
- they can fail “semantically” (invalid JSON, wrong schema),
- they can be blocked/refused (safety behavior),
- they can be expensive.
If you don’t isolate those behaviors, they leak into every part of your codebase.
A practical layering model
A simple, durable architecture:
- UI / entrypoints: CLI, HTTP handlers, jobs. They handle input/output and UX.
- Domain layer: your product logic (what the app is trying to do).
- LLM adapter: the wrapper client that calls the model and returns structured results.
- Infrastructure: config, logging, persistence, caching.
The key is that domain logic should not contain raw prompt strings or provider-specific API calls.
When prompts are embedded inline across the app, you can’t version them, test them, or evolve them safely. Centralize prompt usage.
Design the wrapper interface (inputs/outputs)
Your wrapper should expose a small, typed interface. The interface you choose is more important than the provider API details.
Inputs you typically want
- task name / prompt id: which prompt template to use,
- task inputs: the user data (text, params),
- output mode: free text vs structured (schema),
- options: timeout, retries, temperature overrides, model override.
Outputs you typically want
Prefer returning a structured response object rather than raw text. A practical response includes:
- status: ok / blocked / invalid_output / timeout / rate_limit / error
- result: parsed structured data (or text)
- raw_text: optional (for debugging in dev)
- metadata: request id, prompt version, model, latency, token estimates
- error: categorized error info safe to log and show
Define an explicit request/response contract for your model boundary. That’s what makes the system testable and resilient.
Prompt packaging (versions, metadata, structure)
Your wrapper should not build prompts ad-hoc in random places. It should:
- load prompt templates from files,
- inject task inputs into a template safely,
- include prompt version ids in logs and responses,
- separate “system/house rules” from “task spec” (Part IV Section 10).
This is how you prevent prompt drift and make outputs reproducible.
Validation: treat outputs as untrusted
Even if the model is “usually right,” outputs are untrusted input:
- validate JSON if you asked for JSON,
- validate schemas (required fields, enums, types),
- handle partial/invalid output gracefully (retry or fallback),
- never assume “it will always follow instructions.”
Structured output and validation get a full chapter later (Part V Section 15), but your wrapper should be designed to support it from day one.
Don’t push validation into every caller. Do it once in the wrapper and return a typed, validated result or a clear error.
Testing: dependency injection and fakes
The fastest way to make AI apps testable is to make the LLM adapter injectable.
- Your domain logic depends on an interface like
Summarizer, not on an API client. - In tests, you use a fake summarizer that returns deterministic outputs.
- In production, you bind the real LLM client implementation.
This lets you test your app without making real model calls (faster, cheaper, deterministic).
They’re slow, flaky, and expensive. Save real calls for manual testing and evaluation harnesses.
Recommended file structure
A small, scalable structure:
src/
app/ # domain logic (no provider imports)
summarize.py
llm/ # model boundary
client.py
prompts/
system.md
summarize_v1.md
schemas/
summarize_v1.json
config.py
logging.py
tests/
test_app_summarize.py # uses a fake LLM client
test_llm_schema.py # validates schema enforcement
Copy-paste templates
Template: wrapper interface sketch
LLMRequest:
- prompt_id
- prompt_version
- inputs (task data)
- output_schema (optional)
- options (timeout, retries, model override)
LLMResponse:
- status (ok/blocked/invalid_output/timeout/rate_limit/error)
- result (typed) or null
- raw_text (optional)
- metadata (request_id, model, latency, token_estimate)
- error (category + message)
Template: boundary rule (paste into prompts)
Architecture rule:
- All model/API calls must go through `src/llm/client.*`.
- No other modules may call the provider SDK directly.
- Callers must handle `status != ok` outcomes explicitly.