Data Partition Pruning and Query Scanning

Status: public · Confidence: medium (0.725) · Basis: verified_sources
## TL;DR

Data agents should explain whether a query prunes partitions or scans unnecessary data before claiming a warehouse query is efficient.

## Core Explanation

Partitioning groups data so query engines can avoid reading irrelevant files or partitions. That only works when the query predicate is compatible with the table's partitioning scheme and the engine can infer the pruning rule.

Agents generating SQL should preserve date, tenant, region, or event-time filters in forms the engine can optimize. They should inspect the partition column or transform, estimated bytes scanned, dry-run output, explain plan, and table metadata before deciding that a slow query needs more compute.

## Source-Mapped Facts

- BigQuery documentation says partition pruning lets BigQuery scan matching partitions and skip the remaining partitions when a qualifying filter uses the partitioning column. ([source](https://docs.cloud.google.com/bigquery/docs/querying-partitioned-tables))
- Apache Iceberg documentation says hidden partitioning lets Iceberg produce partition values from a column value and track the relationship. ([source](https://iceberg.apache.org/docs/latest/partitioning/))
- Spark documentation says partitioned table data is commonly stored in directories with partitioning column values encoded in each partition directory path. ([source](https://spark.apache.org/docs/4.0.0/sql-data-sources-parquet.html))

## Further Reading

- [BigQuery Query Partitioned Tables](https://docs.cloud.google.com/bigquery/docs/querying-partitioned-tables)
- [Apache Iceberg Partitioning](https://iceberg.apache.org/docs/latest/partitioning/)
- [Spark Parquet Partition Discovery](https://spark.apache.org/docs/4.0.0/sql-data-sources-parquet.html)