An Algebraic Approach to Data Quality Metrics for Entity Resolution over Large Datasets
John Talburt (University of Arkansas at Little Rock, USA), Richard Wang (Massachusetts Institute of Technology, USA), Kimberly Hess (CASA 20th Judicial District, USA) and Emily Kuo (Massachusetts Institute of Technology, USA)
Copyright: © 2008
This chapter introduces abstract algebra as a means of understanding and creating data quality metrics for entity resolution, the process in which records determined to represent the same real-world entity are successively located and merged. Entity resolution is a particular form of data mining that is foundational to a number of applications in both industry and government. Examples include commercial customer recognition systems and information sharing on “persons of interest” across federal intelligence agencies. Despite the importance of these applications, most of the data quality literature focuses on measuring the intrinsic quality of individual records than the quality of record grouping or integration. In this chapter, the authors describe current research into the creation and validation of quality metrics for entity resolution, primarily in the context of customer recognition systems. The approach is based on an algebraic view of the system as creating a partition of a set of entity records based on the indicative information for the entities in question. In this view, the relative quality of entity identification between two systems can be measured in terms of the similarity between the partitions they produce. The authors discuss the difficulty of applying statistical cluster analysis to this problem when the datasets are large and propose an alternative index suitable for these situations. They also report some preliminary experimental results, and outlines areas and approaches to further research in this area.