1.4 Temperature, randomness, and creativity knobs

On this page

The short version
Sampling has two jobs: explore vs exploit
Temperature (what it does)
Top-p / nucleus sampling
Top-k sampling
Multiple candidates (generate N options)
Max output tokens and truncation
Stop sequences
Repetition controls (if available)
Practical setting recipes
Diagnose problems by symptom
A calibration method that actually works
Where to go next

The short version

When a model generates text, it does not “choose the answer.” It chooses one next token at a time from a probability distribution. The randomness knobs control how it chooses.

Lower randomness → more consistent, more conservative, fewer surprises.
Higher randomness → more diverse, more creative, more variance, more mistakes.

For vibe coding, your goal is to pick settings that match the task: exploration for ideation, exploitation for precise diffs and correctness.

A rule you can use immediately

When correctness matters, lower randomness and increase constraints. When you need options, raise randomness and request multiple candidates.

Sampling has two jobs: explore vs exploit

Think of generation settings as a dial between two modes:

Exploration: “show me many plausible approaches,” “give me alternatives,” “help me brainstorm.”
Exploitation: “do the safe, most likely thing,” “make a minimal diff,” “be consistent and deterministic.”

Most of the time, vibe coding alternates between these modes:

Explore for a plan and a few options.
Exploit for implementation, refactors, and fixes.
Exploit + verify to lock in correctness (tests/evals).

Don’t use “creative settings” to fix a spec problem

If the model keeps missing requirements, the fix is usually: clearer constraints, smaller scope, stronger acceptance criteria—not higher temperature.

Temperature (what it does)

Temperature changes how “peaked” or “flat” the next-token probabilities are before sampling.

Lower temperature makes the model favor the most likely tokens strongly (more predictable).
Higher temperature flattens the distribution so lower-probability tokens become more likely (more variety).

Practical intuition

Temperature is not creativity. It’s variance. High variance sometimes looks creative; other times it looks messy.
Temperature doesn’t add knowledge. It doesn’t make the model “smarter.” It makes it more willing to take less-likely paths.
Low temperature doesn’t guarantee correctness. It just makes the model consistently wrong in the same way if your constraints are missing.

When to use low vs high

Low: bug fixes, refactors, diffs, structured output, tool calling, “follow this spec exactly.”
Higher: brainstorming, naming, UX ideas, alternative architectures, “give me 10 approaches.”
Middle: first-pass scaffolding, draft docs, “good enough but not rigid.”

Temperature + long context can amplify weirdness

If your thread contains stale or conflicting instructions, higher temperature increases the chance the model will “follow the wrong thread.” Fix context first.

Top-p / nucleus sampling

Top-p (nucleus sampling) limits sampling to a set of tokens whose cumulative probability mass is at most p. Instead of considering every possible token, the model considers “the smallest set of likely tokens that adds up to p.”

Why top-p exists

Probability distributions can have a very long tail. Top-p cuts off the unlikely tail so sampling stays “plausible” even at higher temperatures.

Practical intuition

Lower top-p → more conservative, less diversity.
Higher top-p → more diverse, more surprising tokens allowed.

Top-p is often a better “diversity knob” than temperature alone, because it explicitly bounds how far into the tail the model can wander.

A common pattern

For reliability: keep temperature low and top-p moderate. For ideation: raise top-p and generate multiple candidates.

Top-k sampling

Top-k restricts sampling to the k most likely next tokens. If k is small, the model has fewer choices; if k is large, it has more.

Lower top-k → tighter, more repetitive, more stable.
Higher top-k → more variety, but also more opportunities for nonsense.

Top-k vs top-p

Both limit the candidate tokens, but in different ways:

Top-k uses a fixed count of tokens.
Top-p uses a probability mass threshold (dynamic count).

If your UI exposes both, you typically adjust one primary diversity control (temperature or top-p) and keep the other at a reasonable default. The goal is predictability, not knob-maxing.

If you only change one thing

Change temperature for “more/less variance,” and use top-p/top-k only if you need tighter control over how wide the sampling can go.

Multiple candidates (generate N options)

Many systems allow generating multiple completions (“candidates”) for the same prompt. This is often the best way to get variety without turning your output chaotic.

Why it’s powerful

You can keep settings relatively safe while still getting options.
You can choose the best approach (or merge two) without forcing the model into high-variance mode.
You can compare candidates against your acceptance criteria.

How to use it well

Ask for diverse candidates explicitly: “Make each option meaningfully different.”
Require tradeoffs: “List pros/cons and failure modes for each option.”
Then switch to low randomness for the chosen implementation.

High-leverage prompt

“Generate 3 candidates. Each must use a different approach. Then recommend one based on these constraints.”

Max output tokens and truncation

Max output tokens limits how long the model can generate. This is a safety and cost control, but it also changes behavior.

If the limit is too low, you’ll get truncated outputs (cut off mid-thought or mid-code).
If it’s too high, you risk over-generation (extra files, extra explanations, feature creep).

Practical guidance

For code diffs: keep outputs smaller by asking for a minimal patch and limiting scope.
For long explanations: prefer “outline first” then expand sections as needed.
For structured output: keep the schema tight and output limits reasonable so the model can’t wander.

Truncation can look like bugs

If your JSON is invalid or your code block is missing braces, check whether the output was cut off by the token limit before debugging anything else.

Stop sequences

Stop sequences tell the system: “when you generate this sequence, stop generating more tokens.” They’re a simple but powerful control for preventing rambling.

When stop sequences help

Structured output: stop after a final delimiter.
Multi-part outputs: stop after a marker like END.
Tool calling: stop after a JSON/tool call object.

Common pitfalls

Stopping too early: your stop string appears naturally inside content (e.g. “END” in a code comment).
Partial structures: if the stop sequence triggers before closing braces, you’ll get invalid JSON.
Over-reliance: stop sequences don’t enforce correctness; they only enforce stopping.

Use stop sequences like guardrails, not steering

They prevent runaway outputs. They don’t replace schema validation, tests, or careful constraints.

Repetition controls (if available)

Some systems expose controls that reduce repetition (for example: repetition penalties or frequency/presence penalties). The exact names vary by provider and UI, but the intent is the same: discourage the model from repeating tokens or themes it has already produced.

When they help

Long-form writing that loops or rephrases the same sentence.
Brainstorming where candidates keep converging on the same idea.
Chatty outputs that restate constraints over and over.

When to be careful

Code and JSON: repetition controls can hurt correctness because repetition is sometimes required (brackets, keywords, consistent field names).
Precise instructions: penalties can push the model to “avoid repeating” a key constraint and drift away from it.

Default advice for vibe coding

For engineering tasks, you usually get better results by controlling scope and using structured output, rather than aggressively tuning repetition penalties.

Practical setting recipes

Use these as starting points. The exact numeric ranges depend on the model and UI, but the relative intent is consistent.

Recipe: small, safe code diffs

Goal: minimal changes, high consistency.
Settings: low temperature; conservative top-p/top-k; single candidate.
Prompting: “diff-only changes,” “do not touch these files,” “show verification commands.”

Recipe: debugging from errors

Goal: plausible hypothesis + smallest fix.
Settings: low temperature; single candidate.
Prompting: “propose 2–3 hypotheses,” “ask for missing context,” “patch only what’s necessary.”

Recipe: scaffolding a new project

Goal: fast structure, acceptable defaults.
Settings: medium temperature; moderate top-p; optionally 2–3 candidates.
Prompting: request a plan and a minimal runnable skeleton first; then harden.

Recipe: idea generation and alternatives

Goal: diverse options with tradeoffs.
Settings: higher temperature or higher top-p; multiple candidates.
Prompting: require options to be meaningfully different and to list tradeoffs/failure modes.

Recipe: structured output / JSON

Goal: stable schema compliance.
Settings: low temperature; conservative sampling; single candidate; reasonable max output tokens.
Prompting: include a schema, example output, and strict “output JSON only” constraints; validate output.

The best “settings” are often prompt constraints

Smaller scope + clear acceptance criteria + verification steps usually outperform any amount of knob tuning.

Diagnose problems by symptom

Symptom: output is chaotic or ignores the spec

Likely cause: too much variance or unclear constraints.
Fix: reduce temperature/top-p, generate one candidate, restate constraints at the top, request a small diff.

Symptom: output is repetitive or unimaginative

Likely cause: sampling is too conservative, or your prompt is over-constraining.
Fix: ask for multiple candidates; increase diversity slightly; explicitly request alternatives.

Symptom: same prompt gives wildly different results

Likely cause: high randomness settings and/or a messy, contradictory context.
Fix: tighten context (summary + working set), lower randomness, add explicit acceptance criteria.

Symptom: incomplete code/JSON

Likely cause: max output tokens too low, or stop sequence triggering early.
Fix: raise max tokens slightly, adjust stop sequences, ask for output in smaller parts.

Don’t tune blindly

Change one thing at a time, and verify with a small test set. Otherwise you won’t know what caused the improvement.

A calibration method that actually works

If you want settings that reliably work for your tasks, calibrate like an engineer:

Choose a task type (e.g. “small diffs,” “JSON extraction,” “brainstorming”).
Write one canonical prompt with clear constraints and acceptance criteria.
Run 5 trials at one setting and save the outputs.
Change one knob (temperature or top-p) and run 5 more trials.
Score the outputs against your criteria (did it follow constraints? did it pass tests? was it verbose?).
Lock a default for that task type.

This turns “creativity knobs” into a measurable workflow: you pick settings that maximize success rate, not vibes.

What you’re optimizing

For coding tasks, optimize for: success rate per attempt and time-to-verified, not “how impressive the first answer looked.”

1.4 Temperature, randomness, and creativity knobs

The short version

Sampling has two jobs: explore vs exploit

Temperature (what it does)

Practical intuition

When to use low vs high

Top-p / nucleus sampling

Why top-p exists

Practical intuition

Top-k sampling

Top-k vs top-p

Multiple candidates (generate N options)

Why it’s powerful

How to use it well

Max output tokens and truncation

Practical guidance

Stop sequences

When stop sequences help

Common pitfalls

Repetition controls (if available)

When they help

When to be careful

Practical setting recipes

Recipe: small, safe code diffs

Recipe: debugging from errors

Recipe: scaffolding a new project

Recipe: idea generation and alternatives

Recipe: structured output / JSON

Diagnose problems by symptom

Symptom: output is chaotic or ignores the spec

Symptom: output is repetitive or unimaginative

Symptom: same prompt gives wildly different results

Symptom: incomplete code/JSON

A calibration method that actually works

Where to go next