# Data Deduplication and Entity Resolution
Status: public
Confidence: medium (0.725) (verified)
Last verified: 2026-06-02
Generation: ai_structured


## TL;DR

Data deduplication and entity resolution help agents identify when multiple records represent the same person, organization, product, or event.

## Core Explanation

Agents preparing retrieval corpora, customer data, or analytics tables need to know whether duplicate or near-duplicate records distort results. Entity resolution uses identifiers, names, addresses, dates, and similarity rules to link records that refer to the same thing.

Agents should report match confidence and unresolved ambiguity. Incorrect merges can destroy useful distinctions, while missed duplicates can pollute retrieval, metrics, and downstream decisions.

## Source-Mapped Facts

- AWS Entity Resolution documentation describes matching and linking related records stored across multiple applications, channels, or data stores. ([source](https://docs.aws.amazon.com/entityresolution/latest/userguide/what-is-service.html))
- Splink documentation describes Splink as a Python package for probabilistic record linkage and deduplication. ([source](https://moj-analytical-services.github.io/splink/index.html))
- OpenRefine documentation describes cluster and edit as grouping similar cell values and editing them together. ([source](https://openrefine.org/docs/manual/cellediting#cluster-and-edit))

## Further Reading

- [AWS Entity Resolution](https://docs.aws.amazon.com/entityresolution/latest/userguide/what-is-service.html)
- [Splink Documentation](https://moj-analytical-services.github.io/splink/index.html)
- [OpenRefine Cluster and Edit](https://openrefine.org/docs/manual/cellediting#cluster-and-edit)