LLM Evaluation Inspect AI Tasks and Scorers

Status: public · Confidence: medium (0.725) · Basis: verified_sources

## TL;DR

Inspect AI task evidence tells agents how an LLM evaluation is assembled: what samples were run, which solver generated answers, and which scorer judged them.

## Core Explanation

An LLM evaluation is hard to debug when the result is just a pass rate. Inspect AI makes the operational pieces explicit: task, dataset, sample fields, solver, scorer, model, and logs. That structure helps agents distinguish prompt failures, solver failures, grading failures, and dataset coverage gaps.

Agents should inspect task code, sample IDs, input and target fields, solver chain, scorer type, model grader configuration, sandbox settings, run logs, and per-sample scores before changing prompts or declaring a model regression.

## Source-Mapped Facts

- Inspect AI documentation says tasks provide a recipe for an evaluation consisting minimally of a dataset, a solver, and a scorer. ([source](https://inspect.aisi.org.uk/tasks.html))
- Inspect AI documentation says tasks are returned from functions decorated with @task. ([source](https://inspect.aisi.org.uk/tasks.html))
- Inspect AI documentation says the Sample data type has a required input field and optional fields such as choices, target, id, and metadata. ([source](https://inspect.aisi.org.uk/datasets.html))
- Inspect AI documentation says Inspect includes both text matching scorers and model graded scorers. ([source](https://inspect.aisi.org.uk/standard-scorers.html))

## Further Reading

- [Inspect AI Tasks](https://inspect.aisi.org.uk/tasks.html)
- [Inspect AI Datasets](https://inspect.aisi.org.uk/datasets.html)
- [Inspect AI Standard Scorers](https://inspect.aisi.org.uk/standard-scorers.html)