Retrieval Late Interaction and ColBERT

Status: public · Confidence: medium (0.86) · Basis: verified_sources

## TL;DR

Late-interaction retrieval keeps token-level matching signals so agents can reason about recall, ranking quality, and index cost beyond single-vector embeddings.

## Core Explanation

Dense single-vector retrieval compresses each document into one vector. Late-interaction approaches such as ColBERT preserve multiple contextual token embeddings and score fine-grained query-document interactions at retrieval time.

Agents should treat late interaction as a retrieval architecture with operational tradeoffs. It can improve matching fidelity, but index size, compression, pruning, and hardware determine whether it is practical for a given RAG workload.

## Source-Mapped Facts

- The ColBERT paper introduces a late-interaction architecture for efficient and effective passage search over BERT representations. ([source](https://arxiv.org/abs/2004.12832))
- The ColBERTv2 paper describes residual compression and denoised supervision for efficient and effective retrieval. ([source](https://aclanthology.org/2022.naacl-main.272/))
- The PLAID paper describes an engine for efficient retrieval with late-interaction models. ([source](https://doi.org/10.1145/3511808.3557325))

## Further Reading

- [ColBERT Late Interaction Retrieval](https://arxiv.org/abs/2004.12832)
- [ColBERTv2 Efficient and Effective Retrieval](https://aclanthology.org/2022.naacl-main.272/)
- [PLAID Late Interaction Retrieval Engine](https://doi.org/10.1145/3511808.3557325)