Matching Attributes across Overlapping Heterogeneous Data Sources Using Mutual Information

Matching Attributes across Overlapping Heterogeneous Data Sources Using Mutual Information

Huimin Zhao
Copyright: © 2010 |Pages: 20
DOI: 10.4018/jdm.2010100105
OnDemand:
(Individual Articles)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

Identifying matching attributes across heterogeneous data sources is a critical and time-consuming step in integrating the data sources. In this paper, the author proposes a method for matching the most frequently encountered types of attributes across overlapping heterogeneous data sources. The author uses mutual information as a unified measure of dependence on various types of attributes. An example is used to demonstrate the utility of the proposed method, which is useful in developing practical attribute matching tools.
Article Preview
Top

Online Bookstore Example

We will use an example of heterogeneous databases for illustrative purposes. In this case, there are two book catalogs extracted from the Web sites of two leading online bookstores. The catalogs have several corresponding attributes. However, most of the attribute names are not displayed on the Web sites. We manually extracted 737 and 722 records from the Web sites of the two stores, respectively. Tables 1 and 2 show some sample entries of the two catalogs. The attribute names were manually assigned to facilitate discussion, but are not used by the attribute matching method proposed in this paper and have no effect on its result. There is a common key, the ISBN, across the two catalogs. There are 702 matching records, according to the ISBN, in the two sample tables we extracted. Note that the attribute referred to as “Author” may contain multiple authors for a book. This is what we can directly observe at the Web sites of the bookstores. We do not attempt to speculate upon the actual schemas of the backend databases hidden in the “deep Web” (He & Chang, 2006; Su et al., 2006; Wang et al., 2004). We analyze the fields displayed at the Web sites as they are.

Complete Article List

Search this Journal:
Reset
Volume 35: 1 Issue (2024)
Volume 34: 3 Issues (2023)
Volume 33: 5 Issues (2022): 4 Released, 1 Forthcoming
Volume 32: 4 Issues (2021)
Volume 31: 4 Issues (2020)
Volume 30: 4 Issues (2019)
Volume 29: 4 Issues (2018)
Volume 28: 4 Issues (2017)
Volume 27: 4 Issues (2016)
Volume 26: 4 Issues (2015)
Volume 25: 4 Issues (2014)
Volume 24: 4 Issues (2013)
Volume 23: 4 Issues (2012)
Volume 22: 4 Issues (2011)
Volume 21: 4 Issues (2010)
Volume 20: 4 Issues (2009)
Volume 19: 4 Issues (2008)
Volume 18: 4 Issues (2007)
Volume 17: 4 Issues (2006)
Volume 16: 4 Issues (2005)
Volume 15: 4 Issues (2004)
Volume 14: 4 Issues (2003)
Volume 13: 4 Issues (2002)
Volume 12: 4 Issues (2001)
Volume 11: 4 Issues (2000)
Volume 10: 4 Issues (1999)
Volume 9: 4 Issues (1998)
Volume 8: 4 Issues (1997)
Volume 7: 4 Issues (1996)
Volume 6: 4 Issues (1995)
Volume 5: 4 Issues (1994)
Volume 4: 4 Issues (1993)
Volume 3: 4 Issues (1992)
Volume 2: 4 Issues (1991)
Volume 1: 2 Issues (1990)
View Complete Journal Contents Listing