# Code Generation Evaluation with pass@k Status: public Confidence: medium (0.61) (verified) Last verified: 2026-06-02 Generation: ai_structured ## TL;DR pass@k is a code-generation evaluation metric that estimates whether at least one of k sampled solutions passes the benchmark tests. ## Core Explanation Code-generation evaluation is different from text evaluation because executable tests can check behavioral correctness. Benchmarks such as HumanEval and MBPP provide prompts and tests; a model samples one or more candidate programs, then the evaluator runs those candidates in a sandbox. pass@k is useful for comparing sampling strategies and model capability, but it can overstate real engineering quality when tests are weak. Stronger evaluation adds hidden tests, mutation-style edge cases, dependency isolation, security checks, and task suites that cover multi-file or repository-level changes. ## Source-Mapped Facts - The OpenAI HumanEval evaluation harness reports pass@1, pass@10, and pass@100 values when evaluating generated code samples. ([source](https://github.com/openai/human-eval)) - The Google Research repository includes an MBPP directory for the Mostly Basic Python Problems benchmark. ([source](https://github.com/google-research/google-research/tree/master/mbpp)) - The EvalPlus repository describes rigorous evaluation of LLM-synthesized code and provides enhanced benchmark test suites. ([source](https://github.com/evalplus/evalplus)) ## Further Reading - [HumanEval](https://github.com/openai/human-eval) - [MBPP](https://github.com/google-research/google-research/tree/master/mbpp) - [EvalPlus](https://github.com/evalplus/evalplus)