# LLM Evaluation MTEB Embedding and Retrieval Benchmarks Status: public Confidence: medium (0.85) (verified) Last verified: 2026-06-03 Generation: ai_structured ## TL;DR MTEB helps agents compare embedding models across retrieval and non-retrieval tasks instead of relying on a single vector-search demo. ## Core Explanation Embedding models are often chosen for RAG systems, semantic search, clustering, classification, and reranking. A model that performs well on one benchmark or language may not be the right model for another corpus. Agents should keep the benchmark name, task subset, language, model version, pooling or instruction format, retrieval metric, and application corpus separate. For RAG decisions, MTEB-style results are evidence about representation quality, not proof that generated answers will be grounded or useful. ## Source-Mapped Facts - The MTEB paper introduces Massive Text Embedding Benchmark as a benchmark for evaluating text embeddings across diverse tasks. ([source](https://arxiv.org/abs/2210.07316)) - The MTEB paper evaluates embedding models across task categories including retrieval, clustering, classification, reranking, and semantic textual similarity. ([source](https://arxiv.org/abs/2210.07316)) - The MTEB repository describes MTEB as a toolbox for evaluating embeddings and retrieval systems. ([source](https://github.com/embeddings-benchmark/mteb)) ## Further Reading - [MTEB: Massive Text Embedding Benchmark](https://arxiv.org/abs/2210.07316) - [MTEB Repository](https://github.com/embeddings-benchmark/mteb)