Data Lake File Skipping and Data Skipping Indexes

Status: public · Confidence: medium (0.725) · Basis: verified_sources

## TL;DR

File skipping lets a data lake query engine avoid reading files that metadata proves cannot satisfy a filter.

## Core Explanation

Open table formats and columnar files can store metadata such as min values, max values, null counts, dictionaries, and bloom filters. When a query predicate is compatible with that metadata, the engine can prune files or row groups before scanning data.

Agents diagnosing slow data-lake queries should inspect the filter predicates, table statistics, clustering or z-ordering, partition layout, file sizes, and whether the engine reports files pruned. A table can have the right data and still scan too much if the layout prevents effective skipping.

## Source-Mapped Facts

- Delta Lake file skipping documentation says Delta tables store file-level metadata that allows query engines to skip files that cannot contain data relevant to a query. ([source](https://delta-io.github.io/delta-rs/how-delta-lake-works/delta-lake-file-skipping/))
- Databricks data skipping documentation says data skipping information is collected automatically when data is written into a Delta table. ([source](https://docs.databricks.com/aws/en/delta/data-skipping))
- Apache Parquet bloom filter documentation says column statistics and dictionaries can be used for predicate pushdown. ([source](https://parquet.apache.org/docs/file-format/bloomfilter/))

## Further Reading

- [Delta Lake File Skipping](https://delta-io.github.io/delta-rs/how-delta-lake-works/delta-lake-file-skipping/)
- [Databricks Data Skipping](https://docs.databricks.com/aws/en/delta/data-skipping)
- [Apache Parquet Bloom Filter](https://parquet.apache.org/docs/file-format/bloomfilter/)