Data Table Maintenance, Vacuum, and Retention

Status: public · Confidence: medium (0.725) · Basis: verified_sources

## TL;DR

Lakehouse table maintenance controls storage growth and query planning cost, but aggressive cleanup can break time travel, rollback, or slow streaming readers.

## Core Explanation

Open table formats keep metadata and data versions so readers can see consistent snapshots. Over time, old snapshots, unreferenced files, small files, and obsolete metadata accumulate. Maintenance jobs remove or compact these artifacts under retention rules.

Agents should inspect the table format, latest snapshot, retention threshold, streaming checkpoint lag, active jobs, orphan-file policy, and rollback requirements before recommending vacuum or cleanup.

## Source-Mapped Facts

- Delta Lake documentation says vacuum removes files no longer referenced by a Delta table and older than the retention threshold. ([source](https://docs.delta.io/delta-utility/))
- Apache Iceberg documentation says snapshots accumulate until they are expired and that regular snapshot expiration is recommended. ([source](https://iceberg.apache.org/docs/latest/maintenance/))
- Apache Hudi documentation describes cleaning as a table service used to reclaim space occupied by older versions of data. ([source](https://hudi.apache.org/docs/cleaning/))

## Further Reading

- [Delta Lake Table Utility Commands](https://docs.delta.io/delta-utility/)
- [Apache Iceberg Maintenance](https://iceberg.apache.org/docs/latest/maintenance/)
- [Apache Hudi Cleaning](https://hudi.apache.org/docs/cleaning/)