Data Parquet Page Indexes and Bloom Filters

Status: public · Confidence: medium (0.685) · Basis: verified_sources

## TL;DR

Parquet page indexes and Bloom filters give data agents lower-level evidence for whether a selective query can skip pages inside a row group.

## Core Explanation

Row-group statistics are not the only Parquet skipping surface. Page-level indexes and Bloom filters can help readers avoid touching data pages for selective predicates, especially when a scan targets a small subset of rows or high-cardinality values.

Agents should capture whether files were written with page indexes or Bloom filters, which columns have them, the reader engine, predicate shape, sort order, and observed bytes scanned. Missing page-level metadata can turn a point lookup into a wider column scan even when row-group metadata exists.

## Source-Mapped Facts

- Apache Parquet documentation describes a page index as optional ColumnChunk metadata containing DataPage statistics that can be used to skip pages during scans. ([source](https://parquet.apache.org/docs/file-format/pageindex/))
- Apache Parquet documentation says the page index adds ColumnIndex and OffsetIndex structures to row group metadata. ([source](https://parquet.apache.org/docs/file-format/pageindex/))
- Apache Parquet documentation says Bloom filters can enable predicate pushdown for high-cardinality columns. ([source](https://parquet.apache.org/docs/file-format/bloomfilter/))

## Further Reading

- [Apache Parquet Page Index](https://parquet.apache.org/docs/file-format/pageindex/)
- [Apache Parquet Bloom Filter](https://parquet.apache.org/docs/file-format/bloomfilter/)