Vector Databases: Approximate Nearest Neighbor Search, Embedding Storage, and Retrieval at Scale

## TL;DR
Vector databases are the storage engine powering modern AI — from RAG (Retrieval-Augmented Generation) to semantic search to recommendation. They store embeddings (numerical representations of text, images, audio) and perform approximate nearest neighbor (ANN) search to find the most similar items in milliseconds across billions of vectors.

## Core Explanation
Why vector databases: LLMs and embedding models convert unstructured data (text, images, audio) into fixed-length dense vectors (embeddings) — typically 768-4096 dimensions. Semantic similarity = cosine similarity or Euclidean distance between vectors. Problem: finding the k-nearest neighbors among N vectors requires O(Nd) comparisons — prohibitively slow for N > 1M. ANN algorithms trade some recall for dramatic speedup: (1) Graph-based — HNSW (Hierarchical Navigable Small World) builds a multi-layer proximity graph; search traverses graph edges, arriving at nearest neighbor in O(log N) steps. Widely used (FAISS, hnswlib); (2) Quantization-based — product quantization (PQ) compresses vectors by splitting into subvectors, clustering each subspace — achieves 8-32x compression; (3) Tree-based — Annoy, KD-trees, randomized projection trees; (4) Hash-based — locality-sensitive hashing (LSH) maps similar vectors to same bucket.

## Detailed Analysis
Leading vector databases (DataCamp 2026): Pinecone (managed cloud, proprietary index), Weaviate (open-source, hybrid vector+keyword), Milvus (cloud-native, distributed), Qdrant (Rust-based, high-performance), Chroma (lightweight, developer-friendly), pgvector (PostgreSQL extension). FAISS (Meta, open-source) remains the gold standard for research and custom deployments — GPU-accelerated IVF+PQ indexes handle 1B+ vectors with <5ms latency. Microsoft DiskANN (2026): breakthrough in SSD-based indexes — traditional ANN requires all vectors in RAM (hundreds of GB for billion-scale). DiskANN places graph index on SSD with carefully designed data layout minimizing random reads, achieving RAM-level latency at 10x lower cost. DSANN (2025): distributes billion-scale indexes across hundreds of machines with linear scalability, ensuring high availability. Applications: (1) RAG — retrieve relevant documents for LLM context; (2) Semantic search — users query by meaning, not keywords; (3) Multimodal search — text-to-image, image-to-image retrieval; (4) Recommendation — find similar items to user interests; (5) Deduplication — find near-duplicate content across massive corpora. Key challenges: (1) Filtered search — combining vector similarity with structured metadata filters (price range, date, category) without destroying ANN performance; (2) Freshness — inserting new vectors requires index rebuilding (batch) or online insertion with gradual quality degradation.

## Further Reading
- FAISS: Facebook AI Similarity Search (Meta)
- Pinecone, Weaviate, Milvus, Qdrant Vector Databases
- ANN-Benchmarks: Benchmarking ANN algorithms