A Scalable Algorithm for One-to-One, Onto, and Partial Schema Matching with Uninterpreted Column Names and Column Values

A Scalable Algorithm for One-to-One, Onto, and Partial Schema Matching with Uninterpreted Column Names and Column Values

Boris Rabinovich (Department of Information Systems Engineering, Ben-Gurion University of the Negev, Beer Sheva, Israel) and Mark Last (Department of Information Systems Engineering, Ben-Gurion University of the Negev, Beer Sheva, Israel)
Copyright: © 2014 |Pages: 16
DOI: 10.4018/JDM.2014100101
OnDemand PDF Download:
$30.00
List Price: $37.50

Abstract

In this paper, the authors propose a five-step approach to the problem of identifying semantic correspondences between attributes of two database schemas. It is one of the key challenges in many database applications such as data integration and data warehousing. The authors' research is focused on uninterpreted schema matching, where the column names and column values are uninterpreted or unreliable. The approach implements Bayesian networks, Pearson's correlation and mutual information to identify inter-attribute dependencies. Additionally, the authors propose an extension to their algorithm that allows the user to manually enter the known mappings to improve the automated matching results. The five-step approach also allows data privacy preservation. The authors' evaluation experiments show that the proposed approach enhances the current set of schema matching techniques.
Article Preview

Introduction

Manual solutions to schema matching involve examining all tables in the integrated databases and identifying semantically corresponding attributes in each pair of a source table and a target table. These manual processes are prone to errors, limited in their scalability and often too slow to meet the requirement of fast information integration. The goal of automatic and semi-automatic schema matching techniques is to cope with these difficulties in such database applications as schema integration, data warehousing, etc.

There are two main approaches to automatic schema matching, namely the schema-based and the instance-based approach. The schema-based matching considers the schema information like attribute names, data types, relationship types, integrity constraints, and data descriptions. When the column names and values are uninterpreted, the instance-based schema matching approach (Liang, 2008) can identify the semantic correspondence of schema attributes by analyzing the instance data without any schema-level information. A deep insight into the content of schema elements such as data range, data domain and statistics is very helpful for uncovering semantic correspondences. To identify matching columns, which use different encodings of logically similar domains, one can also utilize the inter-attribute dependencies in each table.

The following three types of cardinality constraints should be considered in schema matching:

  • One-to-One Mapping ([1,1] – [1,1], in UML Notation): Each attribute in the source table A has a unique match in the target table B and vice versa. This applies to mapping between tables having the same number of attributes. Consequently, the problem simply involves finding a correspondence between the attributes;

  • Onto Mapping ([0,1] – [1,1]): Each attribute in A has a unique match in B while each attribute in B either has a unique match in A or remains unmatched. This corresponds to cases where we know that table A’s attributes are a subset of table B’s. Hence, we have to discover this subset and then to map its attributes;

  • Partial Mapping ([0,1] – [0,1]): Each attribute in A either has a unique match in B or remains unmatched and vice versa. This corresponds to the most general and difficult cases, where we do not know which attributes of A map to B or even how many attributes of A map to B. In this case, we need to find the best subset of attributes of A to be mapped to B as well as the best mapping for this subset.

Zhao (2010) uses information-theoretic metrics to identify matching attributes of the most common types across two databases. The method application scope is limited to one-to-one mapping between data sources that contain some amount of overlapping records.

Kang and Naughton (2003, 2008) proposed a two-step schema-matching algorithm for the uninterpreted matching domain using inter-attribute dependencies. In the first step, they measure dependencies by calculating mutual information between all attributes. In the second step, they find matching node pairs in the dependency graphs by running a graph-matching algorithm. In their first article, Kang & Naughton (2003) execute graph matching using a naïve exhaustive search, which is impractical, in terms of time complexity, for schemas with a large number of attributes. In their second article, Kang & Naughton (2008) proposed several heuristic algorithms, but the sub-problem they discuss is schema matching between tables with the same number of attributes, where each attribute in Table A has a unique match in Table B and vice versa (one-to-one mapping). A faster schema-matching algorithm using dependencies between discrete and continuous attributes is applied in our previous work (Rabinovich and Last, 2011) to the one-to-one and onto mapping sub-problems.

Complete Article List

Search this Journal:
Reset
Open Access Articles
Volume 28: 4 Issues (2017): 3 Released, 1 Forthcoming
Volume 27: 4 Issues (2016)
Volume 26: 4 Issues (2015)
Volume 25: 4 Issues (2014)
Volume 24: 4 Issues (2013)
Volume 23: 4 Issues (2012)
Volume 22: 4 Issues (2011)
Volume 21: 4 Issues (2010)
Volume 20: 4 Issues (2009)
Volume 19: 4 Issues (2008)
Volume 18: 4 Issues (2007)
Volume 17: 4 Issues (2006)
Volume 16: 4 Issues (2005)
Volume 15: 4 Issues (2004)
Volume 14: 4 Issues (2003)
Volume 13: 4 Issues (2002)
Volume 12: 4 Issues (2001)
Volume 11: 4 Issues (2000)
Volume 10: 4 Issues (1999)
Volume 9: 4 Issues (1998)
Volume 8: 4 Issues (1997)
Volume 7: 4 Issues (1996)
Volume 6: 4 Issues (1995)
Volume 5: 4 Issues (1994)
Volume 4: 4 Issues (1993)
Volume 3: 4 Issues (1992)
Volume 2: 4 Issues (1991)
Volume 1: 2 Issues (1990)
View Complete Journal Contents Listing