## TL;DR
AI document understanding extracts structured data from unstructured documents — invoices, contracts, medical records, tax forms. From traditional OCR pipelines to end-to-end vision-language models, the field has evolved from optical character recognition to semantic comprehension. LLMs can now "read" documents like humans, understanding layout, extracting key fields, and reasoning across tables and text.
## Core Explanation
Document understanding pipeline (traditional): Document image/PDF → (1) Layout analysis — detect text blocks, tables, figures, headers (using object detection models like DETR, LayoutLMv3); (2) OCR — recognize text within each block (Tesseract, Google Vision OCR, TrOCR); (3) Reading order — reconstruct logical text flow from 2D layout; (4) Information extraction — NER (named entity recognition) to extract specific fields (invoice number, date, total amount); (5) Table extraction — detect table structure (rows, columns, merged cells) and extract cell contents; (6) Validation — business rules check extracted data consistency. Modern approaches: end-to-end document understanding using vision-language models — the model takes the document image as input and directly outputs extracted structured data (JSON/dictionary), learning all pipeline stages simultaneously.
## Detailed Analysis
Layout analysis: LayoutLMv3 (Microsoft, 2022) combines text, layout (2D position embeddings), and visual features in a unified transformer architecture. Donut (NAVER, 2022) introduced OCR-free document understanding — mapping document images directly to structured output without explicit text recognition, using a Swin Transformer encoder + BART decoder. Pix2Struct (Google, 2023) extends this to diverse document types including screenshots and figures. The arxiv 2024 survey identifies the key trade-off: modular pipelines are more interpretable and debuggable but require extensive engineering; end-to-end VLMs are simpler to deploy but harder to fix when they make systematic errors. Cloud platforms: Google Document AI offers specialized processors for invoices, receipts, W-2s, bank statements, and identity documents, plus Custom Extractor for novel document types. AWS IDP combines Textract (OCR + table extraction) with Comprehend (entity recognition) and Augmented AI (human review). Key applications: accounts payable automation (invoice → ERP), insurance claims processing, mortgage underwriting, medical record digitization, and contract analysis. LLM-augmented pipelines (2025-2026): GPT-4/Gemini as the final reasoning layer — OCR output + LLM prompt ("Extract the following fields from this invoice: vendor name, date, total, line items") → structured JSON. The 2026 OCR-to-VLM shift represents the next evolution — Gemini 3 Flash and Qwen3-VL processing documents natively as images.
## Further Reading
- LayoutLMv3: Unified Text, Layout, and Image Pretraining (Microsoft)
- Donut: Document Understanding Transformer (NAVER)
- Hugging Face Document AI Models