Data-Centric AI: The Systematic Engineering of Training Data

Status: public · Confidence: medium (0.82) · Basis: verified_sources

## TL;DR
Data-centric AI treats training data as an engineered system rather than a fixed input. It focuses on label quality, coverage, data selection, data debugging, and evaluation design.

## Core Explanation
Model-centric work changes architectures and hyperparameters while holding data mostly fixed. Data-centric work improves the data process: defining labels, finding noisy examples, balancing coverage, selecting examples to label, and evaluating whether a dataset supports the intended task.

## Detailed Analysis
The approach is practical because many deployed models fail from data issues: ambiguous labels, missing edge cases, distribution shift, leakage, duplicates, or biased sampling. Benchmarks such as DataPerf and methods such as confident learning make data quality more measurable and repeatable.

## Further Reading
- A Survey of Data-Centric AI
- DataPerf
- Confident Learning

## Related Articles

- [AI for Data Curation: Web-Scale Filtering, Deduplication, and Quality Scoring for LLM Training](../ai-for-data-curation.md)
- [AI Training Data Curation: Quality at Scale](../ai-training-data-curation.md)
- [Large Language Model Training: Scaling Laws, Data Curation, and Compute](../large-language-model-training-scaling-laws-data-curation-and-compute.md)