A Grid and Cloud Based System for Data Grouping Computation and Online Service

A Grid and Cloud Based System for Data Grouping Computation and Online Service

Wing-Ning Li (University of Arkansas, USA), Donald Hayes (University of Arkansas, USA), Jonathan Baran (University of Arkansas, USA), Cameron Porter (Acxiom Corporation, USA) and Tom Schweiger (Acxiom Corporation, USA)
DOI: 10.4018/978-1-4666-2065-0.ch021
OnDemand PDF Download:
$30.00
List Price: $37.50

Abstract

Record linkage deals with finding records that identify the same real world entity, such as an individual or a business, from a given file or set of files. Record linkage problem is also referred to as the entity resolution or record recognition problem. To locate those records identifying the same real world entity, in principle, pairwise record analyses have to be performed among all records. Analytical operations between two records vary from comparing corresponding fields to enhancing records through large knowledge bases and querying large databases. Hence, these operations are complex and take time. To reduce the number of pairwise record comparisons, blocking techniques are introduced to partition the records into blocks. After that records in each block are analyzed against one and another. One of the effective blocking methods is the closure approach, where a “related” equivalence relation is used to partition the records into equivalence classes. This paper introduces the closure problem and describes the design and implementation of a parallel and distributed closure prototype system running in an enterprise grid.
Chapter Preview
Top

1. Introduction

A record may be viewed conceptually as consisting of a set of fields. When unique identifiers are unavailable or do not exist in records, determining records that represent the same real world entity is an important and challenging problem, which has many applications. For instance, it addresses data quality issues such as ``data accuracy, redundancy, consistency, currency and completeness” (Li, Zhang, & Bheemavaram, 2006). Ensuring data quality is becoming a critical issue that impacts organizational performance(Ballou, Wang, & Pazer, 1998; Ballou, 1999; Delone & Mclean, 1992; Redman, 1998) This problem is also referred to in the literature as record linkage problem (Fellegi & Sunter, 1969; Newcombe, 1988), data cleaning problem (Do & Rahm, 2002), object identification problem (Tejada, Knoblock, & Minton, 2001; Tejada, Knoblock, & Minton, 2002), or entity resolution problem (Benjelloun, Garcia-Molina, Su, & Widom, 2005). All these research efforts deal with the fundamental question of how to effectively identify record ``duplicates” when unique identifiers are unavailable or do not exist in records. The main idea is to rely on matching of other fields in records such as name, address, and so on. It is not uncommon for a record having over hundred fields in real data files. Therefore only a relatively small subset of fields is used to carry out the matching. The set of fields selected is application dependent and is often referred to as keys.

Complete Chapter List

Search this Book:
Reset