Synthetic Data in AI Training

## TL;DR
Synthetic data — generating training examples from other AI models — has emerged as both a powerful scaling technique and a fundamental risk. Models like Phi-4 achieve state-of-the-art results primarily from synthetic data, while model collapse threatens recursive use.

## Core Explanation
Synthetic data generation: (1) use a teacher model to generate diverse examples; (2) curriculum learning — progressively harder synthetic problems; (3) self-play — models generate and solve their own problems. Key quality checks: diversity, accuracy, and domain coverage.

## Detailed Analysis
Phi-4 demonstrated synthetic data can compensate for smaller model size — 14B parameters matching 70B. DeepSeek-R1 used synthetic reasoning traces for distillation. Risks: model collapse (recursive training degrades output), hallucination contamination, and benchmark leakage from synthetic data resembling test sets.

## Further Reading
- Microsoft Research: Phi Series
- Nature: AI Model Collapse
- NeurIPS: Synthetic Data Workshop