Data Parquet Row Groups and Statistics

Status: public · Confidence: medium (0.685) · Basis: verified_sources
## TL;DR

Parquet row groups and statistics help agents explain why a data scan can skip some data, parallelize work, or unexpectedly read too much.

## Core Explanation

Parquet stores data in a hierarchy that query engines can exploit: files contain row groups, row groups contain column chunks, and column chunks contain pages. This lets engines read only needed columns and sometimes skip row groups or pages using metadata.

Agents diagnosing slow lakehouse queries should inspect row group count, row group size, column statistics, page indexes, sort order, compression, and predicate pushdown. A table can have the right file format but still perform poorly if row groups are too small, unsorted, or missing useful statistics.

## Source-Mapped Facts

- Apache Parquet documentation defines a row group as a logical horizontal partitioning of data into rows. ([source](https://parquet.apache.org/docs/concepts/))
- Apache Parquet documentation says a row group consists of a column chunk for each column in the dataset. ([source](https://parquet.apache.org/docs/concepts/))
- Apache Parquet documentation says a file consists of one or more row groups, and row groups contain column chunks that contain pages. ([source](https://parquet.apache.org/docs/concepts/))
- Apache Parquet documentation says the page index contains statistics for data pages and can be used to locate pages that match a scan predicate. ([source](https://parquet.apache.org/docs/file-format/pageindex/))

## Further Reading

- [Apache Parquet Concepts](https://parquet.apache.org/docs/concepts/)
- [Apache Parquet Page Index](https://parquet.apache.org/docs/file-format/pageindex/)