Vector Index Quantization for Retrieval

Status: public · Confidence: medium (0.725) · Basis: verified_sources
## TL;DR

Vector quantization is a retrieval infrastructure technique for shrinking vector indexes and improving speed, usually by trading some precision or recall for lower memory and storage cost.

## Core Explanation

RAG systems that scale to millions or billions of chunks must manage vector memory, latency, and recall. Quantization changes the numeric representation of vectors so the index can be smaller and faster.

The practical agent question is whether retrieval quality still meets the task. Agents should compare recall, grounded answer quality, cost, and latency before and after quantization, especially when changing embedding models or index settings.

## Source-Mapped Facts

- Qdrant quantization documentation says quantization can reduce vector memory usage and improve search speed at the cost of precision. ([source](https://qdrant.tech/documentation/manage-data/quantization/))
- OpenSearch vector quantization documentation describes quantization as reducing vector precision to lower memory and storage usage. ([source](https://docs.opensearch.org/latest/vector-search/optimizing-storage/knn-vector-quantization/))
- Elasticsearch dense vector documentation includes quantized index types for dense vector fields. ([source](https://www.elastic.co/guide/en/elasticsearch/reference/current/dense-vector.html))

## Further Reading

- [Qdrant Quantization](https://qdrant.tech/documentation/manage-data/quantization/)
- [OpenSearch Vector Quantization](https://docs.opensearch.org/latest/vector-search/optimizing-storage/knn-vector-quantization/)
- [Elasticsearch Dense Vector](https://www.elastic.co/guide/en/elasticsearch/reference/current/dense-vector.html)