Data Warehouse Partition Pruning and Clustering

Status: public · Confidence: medium (0.725) · Basis: verified_sources

## TL;DR

Partition pruning and clustering determine whether a warehouse query scans the right slice of data or pays for unnecessary work.

## Core Explanation

Data warehouses use physical layout and metadata to reduce scan cost. Partition pruning skips partitions that cannot match a query. Clustering and sort keys organize related rows so filters can skip or read fewer blocks.

Agents should inspect actual query predicates and execution metadata before recommending partition or clustering changes. A good recommendation names the table, filter pattern, scan bytes, existing layout, and maintenance cost.

## Source-Mapped Facts

- BigQuery documentation describes partition pruning as scanning only relevant partitions when filters use the partitioning column. ([source](https://cloud.google.com/bigquery/docs/querying-partitioned-tables))
- Snowflake documentation describes clustering keys as a way to co-locate similar rows in the same micro-partitions. ([source](https://docs.snowflake.com/en/user-guide/tables-clustering-keys))
- Amazon Redshift documentation says sort keys determine the order in which rows are stored. ([source](https://docs.aws.amazon.com/redshift/latest/dg/t_Sorting_data.html))

## Further Reading

- [BigQuery Query Partitioned Tables](https://cloud.google.com/bigquery/docs/querying-partitioned-tables)
- [Snowflake Clustering Keys](https://docs.snowflake.com/en/user-guide/tables-clustering-keys)
- [Amazon Redshift Sort Keys](https://docs.aws.amazon.com/redshift/latest/dg/t_Sorting_data.html)