Matching Attributes across Overlapping Heterogeneous Data Sources Using Mutual Information

Matching Attributes across Overlapping Heterogeneous Data Sources Using Mutual Information

Huimin Zhao
DOI: 10.4018/978-1-61350-471-0.ch017
(Individual Chapters)
No Current Special Offers


Identifying matching attributes across heterogeneous data sources is a critical and time-consuming step in integrating the data sources. In this paper, the author proposes a method for matching the most frequently encountered types of attributes across overlapping heterogeneous data sources. The author uses mutual information as a unified measure of dependence on various types of attributes. An example is used to demonstrate the utility of the proposed method, which is useful in developing practical attribute matching tools.
Chapter Preview

Online Bookstore Example

We will use an example of heterogeneous databases for illustrative purposes. In this case, there are two book catalogs extracted from the Web sites of two leading online bookstores. The catalogs have several corresponding attributes. However, most of the attribute names are not displayed on the Web sites. We manually extracted 737 and 722 records from the Web sites of the two stores, respectively. Tables 1 and 2 show some sample entries of the two catalogs. The attribute names were manually assigned to facilitate discussion, but are not used by the attribute matching method proposed in this paper and have no effect on its result. There is a common key, the ISBN, across the two catalogs. There are 702 matching records, according to the ISBN, in the two sample tables we extracted. Note that the attribute referred to as “Author” may contain multiple authors for a book. This is what we can directly observe at the Web sites of the bookstores. We do not attempt to speculate upon the actual schemas of the backend databases hidden in the “deep Web” (He & Chang, 2006; Su et al., 2006; Wang et al., 2004). We analyze the fields displayed at the Web sites as they are.

Complete Chapter List

Search this Book: