LLM Evaluation OSWorld Computer-Use Benchmarks

Status: public · Confidence: medium (0.79) · Basis: verified_sources

## TL;DR

OSWorld-style computer-use benchmarks help agents separate static reasoning ability from the ability to observe, click, type, recover, and complete tasks inside real software.

## Core Explanation

Computer-use agents need evaluation beyond answer text. A benchmark can require the agent to inspect a desktop, choose actions, handle UI feedback, and complete work in applications where the state changes after every step.

Agents and evaluators should preserve task IDs, environment images, observation mode, allowed actions, timeouts, reset policy, scoring script, and failure traces. Without those details, a computer-use score cannot explain whether the model failed perception, planning, tool execution, or UI recovery.

## Source-Mapped Facts

- The OSWorld paper introduces a benchmark for multimodal agents to perform open-ended computer tasks in real computer environments. ([source](https://arxiv.org/abs/2404.07972))
- The OSWorld repository provides benchmark environment and evaluation code for OSWorld computer-use tasks. ([source](https://github.com/xlang-ai/OSWorld))
- The OSWorld paper frames computer-use evaluation as different from static text benchmarks because agents must interact with a live desktop environment. ([source](https://arxiv.org/abs/2404.07972))

## Further Reading

- [OSWorld Benchmark Paper](https://arxiv.org/abs/2404.07972)
- [OSWorld Repository](https://github.com/xlang-ai/OSWorld)