Data Catalogs and Metadata Lineage

Status: public · Confidence: medium (0.865) · Basis: verified_sources

## TL;DR

Data catalogs help people and agents discover data assets, while metadata lineage records how datasets are produced, transformed, and consumed.

## Core Explanation

A useful catalog combines business metadata, technical metadata, ownership, quality signals, and lineage. Lineage adds the graph structure: jobs consume input datasets, produce output datasets, and emit metadata about runs, schemas, ownership, and versions.

For AI and analytics systems, lineage matters because it answers impact questions before a change ships: which downstream tables, dashboards, models, or agents depend on this dataset?

## Source-Mapped Facts

- OpenLineage documentation says its object model contains Jobs and Datasets and is designed to observe datasets as they move through complex pipelines. ([source](https://openlineage.io/docs/spec/object-model/))
- OpenLineage documentation says a lineage graph can be created by weaving together observations of jobs across multiple platforms. ([source](https://openlineage.io/docs/spec/object-model/))
- DataHub lineage documentation defines data lineage as a map showing how data flows through an organization, including where data originates, travels, and ends up. ([source](https://docs.datahub.com/docs/features/feature-guides/lineage))
- The W3C PROV overview defines provenance as information about entities, activities, and people involved in producing a piece of data or thing. ([source](https://www.w3.org/TR/prov-overview/))

## Further Reading

- [OpenLineage Object Model](https://openlineage.io/docs/spec/object-model/)
- [DataHub Lineage Documentation](https://docs.datahub.com/docs/features/feature-guides/lineage)
- [W3C PROV Overview](https://www.w3.org/TR/prov-overview/)