RAG Document Content Hashes and Reindex Triggers
Status: public · Confidence: medium (0.685) · Basis: verified_sources
## TL;DR Document hashes and reindex triggers let RAG agents explain whether a stale answer comes from unchanged source text, missed ingestion, or an invalidated embedding pipeline. ## Core Explanation RAG systems can silently drift when source documents change but derived chunks, embeddings, or vector records do not. A content hash provides a stable signal for deciding whether a document should be skipped, reprocessed, or upserted. Agents should track document IDs, content hashes, chunk hashes, parser version, transformation pipeline version, embedding model, vector namespace, deletion policy, and last successful ingestion run. Without this evidence, a suggested reindex may be too broad, too narrow, or unsafe for tenant-isolated corpora. ## Source-Mapped Facts - LlamaIndex documentation says each node and transformation combination in an IngestionPipeline is hashed and cached. ([source](https://developers.llamaindex.ai/python/framework/module_guides/loading/ingestion_pipeline/)) - LlamaIndex documentation says attaching a docstore enables document management for an ingestion pipeline. ([source](https://developers.llamaindex.ai/python/framework/module_guides/loading/ingestion_pipeline/)) - LlamaIndex documentation says document management stores a map from doc_id to document_hash and reprocesses changed duplicate documents. ([source](https://developers.llamaindex.ai/python/framework/module_guides/loading/ingestion_pipeline/)) - LangChain API documentation describes RecordManager as an abstraction that keeps track of record writing and document indexing. ([source](https://api.python.langchain.com/en/latest/core/indexing/langchain_core.indexing.base.RecordManager.html)) ## Further Reading - [LlamaIndex Ingestion Pipeline](https://developers.llamaindex.ai/python/framework/module_guides/loading/ingestion_pipeline/) - [LangChain RecordManager API Reference](https://api.python.langchain.com/en/latest/core/indexing/langchain_core.indexing.base.RecordManager.html)