Data-Centric AI: The Systematic Engineering of Training Data
Status: public · Confidence: medium (0.82) · Basis: verified_sources
## TL;DR Data-centric AI treats training data as an engineered system rather than a fixed input. It focuses on label quality, coverage, data selection, data debugging, and evaluation design. ## Core Explanation Model-centric work changes architectures and hyperparameters while holding data mostly fixed. Data-centric work improves the data process: defining labels, finding noisy examples, balancing coverage, selecting examples to label, and evaluating whether a dataset supports the intended task. ## Detailed Analysis The approach is practical because many deployed models fail from data issues: ambiguous labels, missing edge cases, distribution shift, leakage, duplicates, or biased sampling. Benchmarks such as DataPerf and methods such as confident learning make data quality more measurable and repeatable. ## Further Reading - A Survey of Data-Centric AI - DataPerf - Confident Learning ## Related Articles - [AI for Data Curation: Web-Scale Filtering, Deduplication, and Quality Scoring for LLM Training](../ai-for-data-curation.md) - [AI Training Data Curation: Quality at Scale](../ai-training-data-curation.md) - [Large Language Model Training: Scaling Laws, Data Curation, and Compute](../large-language-model-training-scaling-laws-data-curation-and-compute.md)