# Data Lake Compaction and Small Files
Status: public
Confidence: medium (0.725) (verified)
Last verified: 2026-06-02
Generation: ai_structured


## TL;DR

Data lake compaction reduces small-file overhead, but agents must respect table format semantics and concurrent workloads.

## Core Explanation

Streaming ingestion and frequent small writes can produce many small files. That increases metadata overhead and can make query planning and scans inefficient. Compaction rewrites smaller files into larger files so query engines and table metadata have less work to do.

Agents should not run compaction just because file counts are high. They need table format, partition layout, file size distribution, retention policy, and active writer status. A safe recommendation names the table, partition range, procedure, expected benefit, and rollback or restore path.

## Source-Mapped Facts

- Delta Lake documentation describes bin-packing optimization as compacting small files into larger files. ([source](https://docs.delta.io/latest/optimizations-oss.html))
- Apache Iceberg documentation describes rewrite_data_files as a procedure for rewriting data files. ([source](https://iceberg.apache.org/docs/1.7.1/spark-procedures/#rewrite_data_files))
- Apache Hudi documentation describes file sizing as a mechanism for managing small files. ([source](https://hudi.apache.org/docs/file_sizing))

## Further Reading

- [Delta Lake Optimizations](https://docs.delta.io/latest/optimizations-oss.html)
- [Apache Iceberg Spark Procedures](https://iceberg.apache.org/docs/1.7.1/spark-procedures/#rewrite_data_files)
- [Apache Hudi File Sizing](https://hudi.apache.org/docs/file_sizing)