LLM Sampling Parameters in Evaluation

Status: public · Confidence: medium (0.725) · Basis: verified_sources
## TL;DR

LLM evaluation runs should record sampling parameters because temperature, nucleus sampling, and related settings can change output diversity and reproducibility.

## Core Explanation

Evaluation results are not only properties of a model name. The request configuration can change the distribution of outputs, which affects exact-match scores, judge ratings, latency, and pass/fail rates.

Agents should log sampling parameters with every evaluation trace. When comparing two prompts or models, they should keep sampling settings fixed unless the experiment is explicitly about decoding behavior.

## Source-Mapped Facts

- Azure OpenAI reference documentation includes temperature and top_p parameters for chat completion requests. ([source](https://learn.microsoft.com/en-us/azure/foundry/openai/reference))
- Anthropic Messages API documentation includes temperature and top_p request parameters. ([source](https://docs.anthropic.com/en/api/messages))
- Gemini API text generation documentation describes generation configuration parameters such as temperature, topP, and topK. ([source](https://ai.google.dev/gemini-api/docs/text-generation))

## Further Reading

- [Azure OpenAI REST API Reference](https://learn.microsoft.com/en-us/azure/foundry/openai/reference)
- [Anthropic Messages API](https://docs.anthropic.com/en/api/messages)
- [Gemini API Text Generation](https://ai.google.dev/gemini-api/docs/text-generation)