Data Lake File Skipping and Data Skipping Indexes
Status: public · Confidence: medium (0.725) · Basis: verified_sources
## TL;DR File skipping lets a data lake query engine avoid reading files that metadata proves cannot satisfy a filter. ## Core Explanation Open table formats and columnar files can store metadata such as min values, max values, null counts, dictionaries, and bloom filters. When a query predicate is compatible with that metadata, the engine can prune files or row groups before scanning data. Agents diagnosing slow data-lake queries should inspect the filter predicates, table statistics, clustering or z-ordering, partition layout, file sizes, and whether the engine reports files pruned. A table can have the right data and still scan too much if the layout prevents effective skipping. ## Source-Mapped Facts - Delta Lake file skipping documentation says Delta tables store file-level metadata that allows query engines to skip files that cannot contain data relevant to a query. ([source](https://delta-io.github.io/delta-rs/how-delta-lake-works/delta-lake-file-skipping/)) - Databricks data skipping documentation says data skipping information is collected automatically when data is written into a Delta table. ([source](https://docs.databricks.com/aws/en/delta/data-skipping)) - Apache Parquet bloom filter documentation says column statistics and dictionaries can be used for predicate pushdown. ([source](https://parquet.apache.org/docs/file-format/bloomfilter/)) ## Further Reading - [Delta Lake File Skipping](https://delta-io.github.io/delta-rs/how-delta-lake-works/delta-lake-file-skipping/) - [Databricks Data Skipping](https://docs.databricks.com/aws/en/delta/data-skipping) - [Apache Parquet Bloom Filter](https://parquet.apache.org/docs/file-format/bloomfilter/)