Abstract

In recent years, there has been a proliferation of database systems in all types of organizations. In many cases, these databases are developed in different departments and maintained autonomously. Much is to be gained, however, if databases across departments, divisions, or even organizations can be related to one another. One main problem of relating data stored in different databases is the differences in their representation of real-world entities, such as the use of different identifiers or primary keys. We present a decision theoretic model for matching entities across different databases. The decision to match two entities from two different databases inherently involves some uncertainty since an exact match may not be found because of errors in data collection, data entry, and data representation. We model this uncertainty using probability theory and propose an integer programming formulation that minimizes the total cost associated with the entity matching decision. The model has been implemented and validated on real-world data.

Keywords

DatabaseComputer scienceMatching (statistics)IdentifierProbabilistic logicRepresentation (politics)Uncertain dataData miningDatabase theoryInteger programmingDatabase designArtificial intelligenceMathematicsAlgorithmStatistics

Affiliated Institutions

Related Publications

Publication Info

Year
1998
Type
article
Volume
44
Issue
10
Pages
1379-1395
Citations
63
Access
Closed

External Links

Social Impact

Social media, news, blog, policy document mentions

Citation Metrics

63
OpenAlex

Cite This

Debabrata Dey, Sumit Sarkar, Prabuddha De (1998). A Probabilistic Decision Model for Entity Matching in Heterogeneous Databases. Management Science , 44 (10) , 1379-1395. https://doi.org/10.1287/mnsc.44.10.1379

Identifiers

DOI
10.1287/mnsc.44.10.1379