1.4 Temperature, randomness, and creativity knobs
Overview and links for this section of the guide.
On this page
- The short version
- Sampling has two jobs: explore vs exploit
- Temperature (what it does)
- Top-p / nucleus sampling
- Top-k sampling
- Multiple candidates (generate N options)
- Max output tokens and truncation
- Stop sequences
- Repetition controls (if available)
- Practical setting recipes
- Diagnose problems by symptom
- A calibration method that actually works
- Where to go next
The short version
When a model generates text, it does not “choose the answer.” It chooses one next token at a time from a probability distribution. The randomness knobs control how it chooses.
- Lower randomness → more consistent, more conservative, fewer surprises.
- Higher randomness → more diverse, more creative, more variance, more mistakes.
For vibe coding, your goal is to pick settings that match the task: exploration for ideation, exploitation for precise diffs and correctness.
When correctness matters, lower randomness and increase constraints. When you need options, raise randomness and request multiple candidates.
Sampling has two jobs: explore vs exploit
Think of generation settings as a dial between two modes:
- Exploration: “show me many plausible approaches,” “give me alternatives,” “help me brainstorm.”
- Exploitation: “do the safe, most likely thing,” “make a minimal diff,” “be consistent and deterministic.”
Most of the time, vibe coding alternates between these modes:
- Explore for a plan and a few options.
- Exploit for implementation, refactors, and fixes.
- Exploit + verify to lock in correctness (tests/evals).
If the model keeps missing requirements, the fix is usually: clearer constraints, smaller scope, stronger acceptance criteria—not higher temperature.
Temperature (what it does)
Temperature changes how “peaked” or “flat” the next-token probabilities are before sampling.
- Lower temperature makes the model favor the most likely tokens strongly (more predictable).
- Higher temperature flattens the distribution so lower-probability tokens become more likely (more variety).
Practical intuition
- Temperature is not creativity. It’s variance. High variance sometimes looks creative; other times it looks messy.
- Temperature doesn’t add knowledge. It doesn’t make the model “smarter.” It makes it more willing to take less-likely paths.
- Low temperature doesn’t guarantee correctness. It just makes the model consistently wrong in the same way if your constraints are missing.
When to use low vs high
- Low: bug fixes, refactors, diffs, structured output, tool calling, “follow this spec exactly.”
- Higher: brainstorming, naming, UX ideas, alternative architectures, “give me 10 approaches.”
- Middle: first-pass scaffolding, draft docs, “good enough but not rigid.”
If your thread contains stale or conflicting instructions, higher temperature increases the chance the model will “follow the wrong thread.” Fix context first.
Top-p / nucleus sampling
Top-p (nucleus sampling) limits sampling to a set of tokens whose cumulative probability mass is at most p. Instead of considering every possible token, the model considers “the smallest set of likely tokens that adds up to p.”
Why top-p exists
Probability distributions can have a very long tail. Top-p cuts off the unlikely tail so sampling stays “plausible” even at higher temperatures.
Practical intuition
- Lower top-p → more conservative, less diversity.
- Higher top-p → more diverse, more surprising tokens allowed.
Top-p is often a better “diversity knob” than temperature alone, because it explicitly bounds how far into the tail the model can wander.
For reliability: keep temperature low and top-p moderate. For ideation: raise top-p and generate multiple candidates.
Top-k sampling
Top-k restricts sampling to the k most likely next tokens. If k is small, the model has fewer choices; if k is large, it has more.
- Lower top-k → tighter, more repetitive, more stable.
- Higher top-k → more variety, but also more opportunities for nonsense.
Top-k vs top-p
Both limit the candidate tokens, but in different ways:
- Top-k uses a fixed count of tokens.
- Top-p uses a probability mass threshold (dynamic count).
If your UI exposes both, you typically adjust one primary diversity control (temperature or top-p) and keep the other at a reasonable default. The goal is predictability, not knob-maxing.
Change temperature for “more/less variance,” and use top-p/top-k only if you need tighter control over how wide the sampling can go.
Multiple candidates (generate N options)
Many systems allow generating multiple completions (“candidates”) for the same prompt. This is often the best way to get variety without turning your output chaotic.
Why it’s powerful
- You can keep settings relatively safe while still getting options.
- You can choose the best approach (or merge two) without forcing the model into high-variance mode.
- You can compare candidates against your acceptance criteria.
How to use it well
- Ask for diverse candidates explicitly: “Make each option meaningfully different.”
- Require tradeoffs: “List pros/cons and failure modes for each option.”
- Then switch to low randomness for the chosen implementation.
“Generate 3 candidates. Each must use a different approach. Then recommend one based on these constraints.”
Max output tokens and truncation
Max output tokens limits how long the model can generate. This is a safety and cost control, but it also changes behavior.
- If the limit is too low, you’ll get truncated outputs (cut off mid-thought or mid-code).
- If it’s too high, you risk over-generation (extra files, extra explanations, feature creep).
Practical guidance
- For code diffs: keep outputs smaller by asking for a minimal patch and limiting scope.
- For long explanations: prefer “outline first” then expand sections as needed.
- For structured output: keep the schema tight and output limits reasonable so the model can’t wander.
If your JSON is invalid or your code block is missing braces, check whether the output was cut off by the token limit before debugging anything else.
Stop sequences
Stop sequences tell the system: “when you generate this sequence, stop generating more tokens.” They’re a simple but powerful control for preventing rambling.
When stop sequences help
- Structured output: stop after a final delimiter.
- Multi-part outputs: stop after a marker like
END. - Tool calling: stop after a JSON/tool call object.
Common pitfalls
- Stopping too early: your stop string appears naturally inside content (e.g. “END” in a code comment).
- Partial structures: if the stop sequence triggers before closing braces, you’ll get invalid JSON.
- Over-reliance: stop sequences don’t enforce correctness; they only enforce stopping.
They prevent runaway outputs. They don’t replace schema validation, tests, or careful constraints.
Repetition controls (if available)
Some systems expose controls that reduce repetition (for example: repetition penalties or frequency/presence penalties). The exact names vary by provider and UI, but the intent is the same: discourage the model from repeating tokens or themes it has already produced.
When they help
- Long-form writing that loops or rephrases the same sentence.
- Brainstorming where candidates keep converging on the same idea.
- Chatty outputs that restate constraints over and over.
When to be careful
- Code and JSON: repetition controls can hurt correctness because repetition is sometimes required (brackets, keywords, consistent field names).
- Precise instructions: penalties can push the model to “avoid repeating” a key constraint and drift away from it.
For engineering tasks, you usually get better results by controlling scope and using structured output, rather than aggressively tuning repetition penalties.
Practical setting recipes
Use these as starting points. The exact numeric ranges depend on the model and UI, but the relative intent is consistent.
Recipe: small, safe code diffs
- Goal: minimal changes, high consistency.
- Settings: low temperature; conservative top-p/top-k; single candidate.
- Prompting: “diff-only changes,” “do not touch these files,” “show verification commands.”
Recipe: debugging from errors
- Goal: plausible hypothesis + smallest fix.
- Settings: low temperature; single candidate.
- Prompting: “propose 2–3 hypotheses,” “ask for missing context,” “patch only what’s necessary.”
Recipe: scaffolding a new project
- Goal: fast structure, acceptable defaults.
- Settings: medium temperature; moderate top-p; optionally 2–3 candidates.
- Prompting: request a plan and a minimal runnable skeleton first; then harden.
Recipe: idea generation and alternatives
- Goal: diverse options with tradeoffs.
- Settings: higher temperature or higher top-p; multiple candidates.
- Prompting: require options to be meaningfully different and to list tradeoffs/failure modes.
Recipe: structured output / JSON
- Goal: stable schema compliance.
- Settings: low temperature; conservative sampling; single candidate; reasonable max output tokens.
- Prompting: include a schema, example output, and strict “output JSON only” constraints; validate output.
Smaller scope + clear acceptance criteria + verification steps usually outperform any amount of knob tuning.
Diagnose problems by symptom
Symptom: output is chaotic or ignores the spec
- Likely cause: too much variance or unclear constraints.
- Fix: reduce temperature/top-p, generate one candidate, restate constraints at the top, request a small diff.
Symptom: output is repetitive or unimaginative
- Likely cause: sampling is too conservative, or your prompt is over-constraining.
- Fix: ask for multiple candidates; increase diversity slightly; explicitly request alternatives.
Symptom: same prompt gives wildly different results
- Likely cause: high randomness settings and/or a messy, contradictory context.
- Fix: tighten context (summary + working set), lower randomness, add explicit acceptance criteria.
Symptom: incomplete code/JSON
- Likely cause: max output tokens too low, or stop sequence triggering early.
- Fix: raise max tokens slightly, adjust stop sequences, ask for output in smaller parts.
Change one thing at a time, and verify with a small test set. Otherwise you won’t know what caused the improvement.
A calibration method that actually works
If you want settings that reliably work for your tasks, calibrate like an engineer:
- Choose a task type (e.g. “small diffs,” “JSON extraction,” “brainstorming”).
- Write one canonical prompt with clear constraints and acceptance criteria.
- Run 5 trials at one setting and save the outputs.
- Change one knob (temperature or top-p) and run 5 more trials.
- Score the outputs against your criteria (did it follow constraints? did it pass tests? was it verbose?).
- Lock a default for that task type.
This turns “creativity knobs” into a measurable workflow: you pick settings that maximize success rate, not vibes.
For coding tasks, optimize for: success rate per attempt and time-to-verified, not “how impressive the first answer looked.”