13.5 Versioning prompts (treat prompts like code)

Overview and links for this section of the guide.

Why prompts must be versioned

In AI apps, prompt changes are behavior changes. If you don’t version prompts, you will experience:

  • silent regressions (“it used to work”),
  • inconsistent outputs across environments,
  • inability to reproduce good runs,
  • debugging that turns into guessing.

Versioning prompts makes AI behavior manageable the same way versioning code makes software manageable.

Prompts are product logic

Even if your “code” is unchanged, a prompt update can change results. Treat prompts as first-class artifacts.

What to version (prompts, schemas, settings)

Versioning only the prompt text is not enough. You want a complete “behavior bundle”:

  • Prompt templates: system/house rules and task prompts.
  • Schemas: JSON schemas or structured-output contracts.
  • Model settings: key parameters that affect output (temperature, max output).
  • Model choice: model name/version used for a given prompt version.

At minimum, prompts + schemas should be versioned in git.

A practical repo structure for prompts

A simple structure that scales:

src/
  llm/
    prompts/
      system.md
      tasks/
        summarize/
          v1.md
          v2.md
        extract_fields/
          v1.md
    schemas/
      summarize/
        v1.json
        v2.json
      extract_fields/
        v1.json

This makes it obvious:

  • which tasks exist,
  • which versions are available,
  • which schema matches which prompt.
Keep system rules separate

Put stable “house rules” in a shared system.md and keep task prompts focused. This reduces duplication and drift.

Prompt IDs and versions (how to name things)

Pick a naming scheme you can log and search easily. A practical scheme:

  • Prompt ID: summarize, extract_fields, answer_with_sources
  • Prompt version: v1, v2, … (or semantic versions if you prefer)
  • Full identifier: summarize@v2

Then your app can log: prompt_id=summarize, prompt_version=v2, schema_version=v2.

Version bumps should mean something

If you changed behavior, bump the version. If you only fixed formatting with no behavior change, you may not need a bump—but be honest.

Changing prompts safely (review + verification)

Prompt changes should follow a mini engineering workflow:

  1. Write acceptance criteria: what must remain true?
  2. Update prompt version: create v2 rather than editing v1 in place (safer early on).
  3. Update schema if needed: keep schema/prompt aligned.
  4. Run a small eval set: 10–25 examples that matter (even manual early).
  5. Review diffs: prompt diffs are behavior diffs.
  6. Roll out deliberately: use a config flag to choose v1 vs v2.
Never “edit v1” in production systems

If you overwrite the prompt that produced yesterday’s behavior, you lose the ability to reproduce and debug. Add v2 instead.

Logging prompt versions in your app

Prompt version logging is the difference between “we can debug this” and “we’re guessing.” Log:

  • prompt id and version,
  • schema version (if structured),
  • model name,
  • key settings (temperature),
  • outcome category and latency.

This connects production behavior back to a specific artifact in git.

Migration and backwards compatibility

Prompt and schema changes can break consumers. Reduce breakage by:

  • keeping old versions available for a while,
  • introducing new fields as optional before making them required,
  • supporting multiple schema versions in the parser/validator temporarily,
  • rolling out behind a feature flag (or environment config).

Even small apps benefit from this discipline, because it keeps iteration safe.

Copy-paste templates

Template: prompt file header

# Prompt: summarize@v2

Purpose:
- Summarize an article into structured bullets.

Inputs:
- article_text: string

Output:
- JSON matching schema summarize/v2.json

Constraints:
- No hallucinated facts; stay grounded in provided text
- If text is missing/ambiguous, say so
- Be concise and structured

Template: prompt changelog entry

summarize@v2
- Changed: added explicit grounding rules
- Changed: tightened schema (required fields)
- Added: handling for empty input
- Notes: expect fewer hallucinations; slightly more refusals on ambiguous inputs

Template: prompt version config

Config:
- PROMPT_SUMMARIZE_VERSION=v2
- SCHEMA_SUMMARIZE_VERSION=v2

Where to go next