RAG Document Layout and Table Extraction

Status: public · Confidence: medium (0.685) · Basis: verified_sources

## TL;DR

RAG over PDFs and scanned documents needs layout and table evidence, not just plain text chunks.

## Core Explanation

Document RAG often fails when extraction loses reading order, table cells, section headers, captions, or page coordinates. A chunk can be semantically close to the query but still omit the row, column, or heading that makes the answer correct.

Agents should preserve page number, bounding regions, detected tables, cell coordinates, merged cells, headings, and extraction model version. That evidence lets a citation point back to a specific document region instead of a flattened text fragment.

## Source-Mapped Facts

- Microsoft Document Intelligence documentation says the Layout model extracts text, tables, selection marks, and structure information from documents. ([source](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/prebuilt/layout?view=doc-intel-4.0.0))
- Amazon Textract documentation says table extraction returns information about tables detected in a document. ([source](https://docs.aws.amazon.com/textract/latest/dg/how-it-works-tables.html))
- Amazon Textract documentation distinguishes table cell blocks from merged cell blocks in table extraction output. ([source](https://docs.aws.amazon.com/textract/latest/dg/how-it-works-tables.html))

## Further Reading

- [Microsoft Document Intelligence Layout Model](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/prebuilt/layout?view=doc-intel-4.0.0)
- [Amazon Textract Tables](https://docs.aws.amazon.com/textract/latest/dg/how-it-works-tables.html)