LLM Evaluation Terminal-Bench Command-Line Agent Benchmarks

Status: public · Confidence: medium (0.79) · Basis: verified_sources
## TL;DR

Terminal-Bench evaluates agents on executable terminal tasks, making it useful for measuring real command-line work rather than only text answers.

## Core Explanation

Many agent benchmarks test final answers, but command-line agents need to edit files, run commands, inspect failures, and satisfy automated tests inside an environment. Terminal-Bench is relevant because the harness, task dataset, sandbox, and verification scripts are part of the evaluation surface.

Agents and eval systems should record benchmark version, task IDs, sandbox image, adapter, model, allowed tools, pass/fail tests, logs, retries, and resource limits. Without that metadata, a score is hard to reproduce or compare.

## Source-Mapped Facts

- The Terminal-Bench repository describes Terminal-Bench as a benchmark for testing AI agents in real terminal environments. ([source](https://github.com/harbor-framework/terminal-bench))
- The Terminal-Bench repository says Terminal-Bench consists of a task dataset and an execution harness that connects a language model to a terminal sandbox. ([source](https://github.com/harbor-framework/terminal-bench))
- The Terminal-Bench arXiv paper says Terminal-Bench 2.0 contains 89 tasks in computer terminal environments. ([source](https://arxiv.org/abs/2601.11868))

## Further Reading

- [Terminal-Bench GitHub Repository](https://github.com/harbor-framework/terminal-bench)
- [Terminal-Bench arXiv Paper](https://arxiv.org/abs/2601.11868)