Learned Database Systems: AI-Driven Query Optimization, Learned Indexes, and Cardinality Estimation

## TL;DR
Learned database systems replace decades of hand-tuned heuristics in databases with machine learning models. From learned indexes that replace B-trees with tiny neural networks to learned query optimizers that beat expert-designed cost models, AI is challenging the assumption that classical data structures and algorithms are always optimal for specific data distributions.

## Core Explanation
The insight driving learned databases: classical data structures (B-trees, hash tables, Bloom filters) are general-purpose -- they make no assumptions about data distribution. But real data has structure: employee IDs are sequential, timestamps increase monotonically, and certain values are far more common than others. Learned components exploit this: (1) Learned index: instead of B-tree log(N) search through internal nodes, train a small neural network that maps key -> approximate position. Given key k, model predicts position f(k); correct within error bounds; binary search in error range. The CDF (Cumulative Distribution Function) model: P(X <= k) * N approximates index position; (2) Learned cardinality estimation: given query "age > 30 AND city = Boston", estimate how many rows match. Traditional: histograms + independence assumption (wrong for correlated attributes). Learned: deep sets or GNNs model joint distribution from query workload; (3) Learned query optimizer: replace cost model with neural network predicting query latency; learned join ordering beats genetic algorithm heuristics.

## Detailed Analysis
RMI (Recursive Model Index): hierarchy of simple models. Stage 1 model: broad positioning; Stage 2: refinement. Bottom-level models predict exact positions. 3x faster than B-tree for read-only, 100x less memory. PGM-Index: provides worst-case error bounds and supports inserts. Cardinality estimation: MSCN (Multi-Set Convolutional Network) represents query predicates as sets, preserving permutation invariance. NeuroCard uses autoregressive models (NARU) over single-table sample, extending to multi-table via joins. Q-error < 3x vs. >100x for PostgreSQL. Learned query optimizers: Bao (Marcus et al., SIGMOD 2021) -- uses Thompson sampling bandit to select among existing optimizer-generated plans; learns per-query which PostgreSQL optimizer hint works best. Neo (Marcus et al., VLDB 2019) -- fully learned query optimizer using DNN for cost estimation + value iteration for plan search. Springer 2025 survey: the "last mile" challenge -- learned components work in research prototypes, integration into production DBMS (PostgreSQL, MySQL, Oracle) requires solving concurrency, crash recovery, and ACID compatibility. Recent progress: PostgreSQL hooks for learned cardinality; DuckDB's extensible optimizer enabling ML integration.