Optimizing the Accuracy of Entity-Based Data Integration of Multiple Data Sources Using Genetic Programming Methods

Optimizing the Accuracy of Entity-Based Data Integration of Multiple Data Sources Using Genetic Programming Methods

Yinle Zhou (University of Arkansas at Little Rock, USA), Ali Kooshesh (Sonoma State University, USA) and John Talburt (University of Arkansas at Little Rock, USA)
Copyright: © 2012 |Pages: 11
DOI: 10.4018/jbir.2012010105
OnDemand PDF Download:
$30.00
List Price: $37.50

Abstract

Entity-based data integration (EBDI) is a form of data integration in which information related to the same real-world entity is collected and merged from different sources. It often happens that not all of the sources will agree on one value for a common attribute. These cases are typically resolved by invoking a rule that will select one of the non-null values presented by the sources. One of the most commonly used selection rules is called the naïve selection operator that chooses the non-null value provided by the source with the highest overall accuracy for the attribute in question. However, the naïve selection operator will not always produce the most accurate result. This paper describes a method for automatically generating a selection operator using methods from genetic programming. It also presents the results from a series of experiments using synthetic data that indicate that this method will yield a more accurate selection operator than either the naïve or naïve-voting selection operators.
Article Preview

Background

Entity-based data integration (EBDI) is the process of integrating and rationalizing the collective information associated with the same real-world entities. Each record related to a particular entity may only provide a small portion of information about that entity, but when combined with the information from other records a more comprehensive picture can emerge. Having multiple values for the same attribute can have both positive and negative implications. When the attribute values agree, it tends to increase the level of confidence that the values are accurate. On the other hand when there are conflicting values, it begs the question of which, if any, value is correct. The process of resolving these conflicts and deciding which values to keep or discard is sometimes called knowledgebase arbitration (Doerr, 2003; Liberatore, 1995; Revesz, 1993).

The formal description of EBDI extends the Algebraic model of entity resolution (ER) proposed by Talburt, Wang, Hess, and Kuo (2007) in which an ER process is defined in terms of an equivalence relation on a set of entity references (Talburt & Hashemi, 2008; Holland & Talburt, 2009; Talburt, 2011). The formal description of EBDI begins with the concept of an Integration Context. The integration context provides an explicit mechanism to describe both entity equivalence (the ER part) and attribute equivalence (the integration part) across a collection of information sources. Both entity and attribute equivalence must be considered when dealing with entity-based integration.

The evaluation of selection operators is best illustrated by example. Table 1 shows an integration context of three sources S1, S2, and S3 for which the entity equivalence relation X creates 10 integration entities. The columns labeled S1, S2, and S3 contain the values contributed by each of these sources for a particular integration attribute that can take on any one of values “A”, “B”, “C”, “D”, or null. Furthermore, the column labeled as True shows the correct value of this attribute for each of the 10 integration entities.

Table 1.
Accuracy of sources and selection operators
TrueS1S2S3NaiveBestWorst
1BA--AAA
2C-CACCA
3AAADAAD
4BBBCBBC
5DCDDCDC
6CCB-CCB
7D--DDDD
8B-DBDBD
9AA-BAAB
10BBACBBA
100%50%40%30%70%90%10%

Complete Article List

Search this Journal:
Reset
Open Access Articles: Forthcoming
Volume 8: 2 Issues (2017): 1 Released, 1 Forthcoming
Volume 7: 2 Issues (2016)
Volume 6: 2 Issues (2015)
Volume 5: 4 Issues (2014)
Volume 4: 4 Issues (2013)
Volume 3: 4 Issues (2012)
Volume 2: 4 Issues (2011)
Volume 1: 4 Issues (2010)
View Complete Journal Contents Listing