RAG Contextual Compression

Status: public · Confidence: medium (0.725) · Basis: verified_sources

## TL;DR

RAG contextual compression narrows retrieved context after initial retrieval so the generator sees fewer, more relevant passages.

## Core Explanation

Naive RAG often sends every top-k chunk to the model. Contextual compression inserts a post-retrieval step that filters, transforms, or reranks retrieved material according to the current query. The goal is to preserve answer-critical evidence while reducing distraction and context-window cost.

This is not a substitute for recall. If the first-stage retriever misses the right document, compression cannot recover it. A good evaluation plan measures both retrieval recall before compression and answer quality after compression.

## Source-Mapped Facts

- LangChain documentation shows a CrossEncoderReranker used with ContextualCompressionRetriever to rerank retrieved documents. ([source](https://docs.langchain.com/oss/python/integrations/document_transformers/cross_encoder_reranker))
- LlamaIndex documentation says node postprocessors take a set of nodes and apply transformation, filtering, or re-ranking logic. ([source](https://developers.llamaindex.ai/python/framework/module_guides/querying/node_postprocessors/))
- Cohere reranking documentation describes using a rerank model to return the most relevant documents for a query. ([source](https://docs.cohere.com/v2/docs/reranking-with-cohere))

## Further Reading

- [LangChain Cross Encoder Reranker Integration](https://docs.langchain.com/oss/python/integrations/document_transformers/cross_encoder_reranker)
- [LlamaIndex Node Postprocessor](https://developers.llamaindex.ai/python/framework/module_guides/querying/node_postprocessors/)
- [Reranking with Cohere](https://docs.cohere.com/v2/docs/reranking-with-cohere)