Sparse Retrieval and BM25

Status: public · Confidence: medium (0.725) · Basis: verified_sources
## TL;DR

Sparse retrieval uses exact or token-based text features, and BM25 is a widely used relevance model for ranking documents by keyword matches.

## Core Explanation

Dense embeddings are useful for semantic similarity, but sparse retrieval remains important because users often search for exact names, error codes, API symbols, identifiers, dates, and rare technical terms. BM25 rewards matching terms while accounting for term frequency and document length.

In RAG systems, BM25 is often paired with vector search. Sparse retrieval provides high precision for exact terms, while dense retrieval expands recall for paraphrases. Hybrid retrieval then fuses or reranks candidates before the model sees them.

## Source-Mapped Facts

- Elasticsearch similarity documentation lists BM25 as a built-in similarity and the default similarity for text scoring. ([source](https://www.elastic.co/docs/reference/elasticsearch/index-settings/similarity))
- Apache Lucene documentation provides a BM25Similarity class for implementing BM25-based relevance scoring. ([source](https://lucene.apache.org/core/9_10_0/core/org/apache/lucene/search/similarities/BM25Similarity.html))
- Weaviate keyword search documentation describes BM25 as a keyword search algorithm that ranks objects by keyword relevance. ([source](https://docs.weaviate.io/weaviate/concepts/search/keyword-search))

## Further Reading

- [Elasticsearch similarity settings](https://www.elastic.co/docs/reference/elasticsearch/index-settings/similarity)
- [Apache Lucene BM25Similarity](https://lucene.apache.org/core/9_10_0/core/org/apache/lucene/search/similarities/BM25Similarity.html)
- [Weaviate keyword search](https://docs.weaviate.io/weaviate/concepts/search/keyword-search)