Data ORC Stripes and Column Statistics

Status: public · Confidence: medium (0.685) · Basis: verified_sources
## TL;DR

ORC stripes and column statistics help data agents explain why analytical queries can skip or scan large parts of a file.

## Core Explanation

ORC is a columnar file format used in analytical data systems. Its metadata matters because query engines can use stripe layout, indexes, and column statistics to reduce the amount of data read for selective predicates.

Agents should capture file format version, stripe count, stripe sizes, row indexes, compression, schema, column statistics, and query engine predicate-pushdown behavior. A slow query can be caused by missing or unhelpful statistics, oversized stripes, incompatible readers, or predicates that cannot use the available metadata.

## Source-Mapped Facts

- Apache ORC specification describes ORC file content as divided into stripes. ([source](https://orc.apache.org/specification/ORCv1/))
- Apache ORC specification says each stripe contains index data, row data, and a stripe footer. ([source](https://orc.apache.org/specification/ORCv1/))
- Apache ORC indexes documentation describes file-level, stripe-level, and row-level indexes as statistics about column values. ([source](https://orc.apache.org/docs/indexes.html))

## Further Reading

- [ORC Specification v1](https://orc.apache.org/specification/ORCv1/)
- [Apache ORC Indexes](https://orc.apache.org/docs/indexes.html)