In information integration systems, duplicate records bring problems in data processing and analysis. To represent the similarity between two records from different data sources with different schema, the optimal bipartite graph matching is adopted on the attributes of them, and the similarity is measured as the weight of such matching. Based on similarity estimation, the basic idea in this chapter is to estimate the range of the records similarity and to determine whether they are duplicate records according to the estimation. When data integration is performed on XML data, there are many problems because of the flexibility of XML. One of the current implementations is to use Data Exchange to carry out the above operations. This chapter proposes the concept of quality assurance mechanisms besides the data integrity and reliability.
TopIntroduction
Information integration systems have been widely used. When integrating data from heterogeneous and autonomous data sources, multiple records representing the same entity are often obtained. Such records are called duplicate records. In an information integration system, duplicate records not only cause data redundancy, resulting in the waste of network bandwidth and storage space, but also provide users too many useless duplicate results. In order to increase the efficient and usability of information integration systems, it is necessary to eliminate duplicate records. Duplicate record detection, which is to partition the record set into clusters each representing the same entity, is the first step of duplicate record elimination..
Because of its widely applications, many techniques have been presented to deal with duplicate records detection. (Elmagarmid et al. 2007) is a survey of the early work of duplicate records detection. Some recent work (Bilenko et al.2003, Chandel et al. 2007, Chaudhuri et al. 2003, Cohen et al. 2000) adopt similarity functions of textual similarity, such as edit distance or cosine similarity. Those methods consider that the records corresponding to the same entity are textually similar, and are indeed useful for catching most kinds of typographic errors. The major drawback of textual similarity is that it may be misleading in the conditions of heterogeneous expressions (e.g. J. Smith is short for both John Smith and Jane Smith). Some methods based on segmentation (Borkar et al. 2001, Sarawagi et al. 2004, Viola et al. 2005) are proposed. Segmentation is to partition a string into its constituent components. Segmentation-based methods can solve the problem of heterogeneous expressions in some degree. There are also some other similarity descriptions are proposed. (Cohen et al. 2004) uses a similarity function to address this problem in this setting. Transformation-based and grammar-based methods are proposed in (Arasu.et al. 2008) and (Arasu et al. 2009), respectively. However all these methods does not consider the problem that the records may have various schema or attribute order and not practical for duplicate detection on records from heterogeneous data sources.
To represent the similarity of records with heterogeneous schemas, optimal bipartite graph matching is adopted (naïve Optimal Bipartite Matching Based, OBMAB) (Mohan et al. 2009). The pair of records is modeled as a bipartite graph with each node as an attribute, each edge as connecting a pair of attributes one from each record. The weights of edges are the similarity between corresponding pair of attributes. With such model, the similarity of two records is defined as the weight of optimal matching in corresponding bipartite graph. Such definition can describe the similarity of any records with different schema. As the comparing records are not necessarily having the same attribute number and order, the method is more suitable for the detection of the duplicate records from heterogeneous data sources effectively. However, each record needs to compare with a large number of records before it is assigned to the appropriate cluster. The method is not suitable for large data set. Moreover, a record will not be classified into a cluster even if only one record and the current record do not meet the requirements of similarity. This makes the method sensitive and results in low recall, which is the ratio of the detected duplicate records to all the duplicate records. Take the following case as an example.