Code Generation Evaluation with pass@k

Status: public · Confidence: medium (0.61) · Basis: verified_sources

## TL;DR

pass@k is a code-generation evaluation metric that estimates whether at least one of k sampled solutions passes the benchmark tests.

## Core Explanation

Code-generation evaluation is different from text evaluation because executable tests can check behavioral correctness. Benchmarks such as HumanEval and MBPP provide prompts and tests; a model samples one or more candidate programs, then the evaluator runs those candidates in a sandbox.

pass@k is useful for comparing sampling strategies and model capability, but it can overstate real engineering quality when tests are weak. Stronger evaluation adds hidden tests, mutation-style edge cases, dependency isolation, security checks, and task suites that cover multi-file or repository-level changes.

## Source-Mapped Facts

- The OpenAI HumanEval evaluation harness reports pass@1, pass@10, and pass@100 values when evaluating generated code samples. ([source](https://github.com/openai/human-eval))
- The Google Research repository includes an MBPP directory for the Mostly Basic Python Problems benchmark. ([source](https://github.com/google-research/google-research/tree/master/mbpp))
- The EvalPlus repository describes rigorous evaluation of LLM-synthesized code and provides enhanced benchmark test suites. ([source](https://github.com/evalplus/evalplus))

## Further Reading

- [HumanEval](https://github.com/openai/human-eval)
- [MBPP](https://github.com/google-research/google-research/tree/master/mbpp)
- [EvalPlus](https://github.com/evalplus/evalplus)