Data Cleaning Based on Entity Resolution

Data Cleaning Based on Entity Resolution

Copyright: © 2014 |Pages: 22
DOI: 10.4018/978-1-4666-5198-2.ch012
OnDemand:
(Individual Chapters)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

Data quality is one of the most prevalent problems in data management. A traditional data management application typically concerns the creation, maintenance, and use of a large amount of data, focusing only on clean datasets. However, real-life data are often dirty: inconsistent, duplicated, inaccurate, incomplete, or out of date. Derived from these issues, the problem of conformity of facts from a large amount of conflicting information provided by various Web sets or different data sources to be integrated receives increasing attention. False data can generate misleading or biased analytical results and decisions and lead to loss of revenue, credibility, and customers. Based on the results of entity resolution, truth discovery shares an important role in modern data management applications. In this chapter, the authors review approaches to processing truth discovery related to central aspects of data quality (i.e., data consistency, data reduplication, data accuracy, data currency, and information completeness).
Chapter Preview
Top

Introduction

Many data management applications require integrating data from multiple sources (or views), each of which provides a set of values as “facts”. We do not distinguish between the terms of “data sources” and “views” in the following. However, different data sources can often provide conflicting values, some being true while some being false. Considering an example illustrated in Table 1(a), there are five data sources providing information on affiliations of five researchers and only 978-1-4666-5198-2.ch012.m01 provides all correct data. All these five data sources provide a set of “facts” about the affiliation information, but “facts” often does not mean the truth, so these sources may give true values as well as false values. When integrating data from these data sources, we have to find out the truth (trustworthy affiliations) of these five researchers. The world-wide web has become the most important information source for most people. Unfortunately, there is no guarantee for the correctness of information on the web. Moreover, different web sites often provide conflicting information on a subject, such as different release date for a new coming product. Table 1(b) depicts authors of the book “Rapid Contextual Design” (ISBN: 0123540518) searched from different online bookstores. From the image of the book cover we found that A1 Books provides the most accurate information. In comparison, the information from Powell’s books is incomplete, and that from Lakeside books is incorrect. Likewise, we have to extract truth from the many conflicting “facts” provided by different websites.

Table 1(a).
The motivating example: information on the affiliations of researchers.
S1S2S3S4S5
StonebrakerMITBerkeleyMITMITMS
DewittMSRMSRUWiscUWiscUWisc
BernsteinMSRMSRMSRMSRMSR
CareyUCIAT&TBEABEABEA
HalevyGoogleGoogleUWUWUW

Complete Chapter List

Search this Book:
Reset