The growth of the Internet has simplified data access, which has involved an increment in the creation of new data sources. Despite this increment, in most cases, these large data repositories are accessed manually. This problem is aggravated by the heterogeneous nature and extreme volatility of the information on the Web. This heterogeneity includes three types: intentional (differences in the contents), semantic (differences in the interpretation), and schematic (data types, labeling, structures, etc.). Thus, the increase of the available information and the complexity of dealing with this amount of information have involved a considerable amount of research into the subject of heterogeneous data integration. The database community, one of the most important groups dealing with data heterogeneity and dispersion, has provided a wide range of solutions to this problem. However, this issue has also been addressed and solutions have been offered by the information retrieval and knowledge representation communities, making this area a connection point between the three communities.
Traditional approaches for heterogeneous data integration try to resolve semantic and schematic heterogeneity using solutions based on rich data models. These data models tend to represent the relationships between distributed and heterogeneous data sources. Despite the fact that most traditional systems deal with a small number of structured data sources, more recent approaches deal with a larger number of data sources (both structured and unstructured).
Data integration systems are formally defined as a triple <G,S,M>, where G is the global (or mediated) schema, S is the heterogeneous set of source schemas, and M is the mapping that maps queries between the source and the global schemas. Both G and S are expressed in languages over alphabets comprised of symbols for each of their respective relations. The mapping M consists of assertions between queries over G and queries over S. When users send queries to the data integration system, they describe those queries over G, and the mapping then asserts connections between the elements in the global schema and the source schemas.
Key Terms in this Chapter
Mediator: Systems that filter information from one or more data sources that are usually accessed using wrappers. The main goal of these systems is to allow users to make complex queries over heterogeneous sources as if it were a single one, using an integration schema. Mediators offer user interfaces for querying the system based on the integration schema. They transform user queries into a set of subqueries that other software components (the wrappers), which encapsulate data sources’ capabilities, will solve.
Wrapper: An interface to a data source that translates data into a common data model used by the mediator. The user accesses the data sources through one or several mediator systems that present high-level abstractions (views) of combinations of source data. The user does not know where the data come from but is able to retrieve the data by using a common mediator query language.
Ontology: A logical theory accounting for the intended meaning of a formal vocabulary (i.e., its ontological commitment to a particular conceptualization of the world). The intended models of a logical language using such a vocabulary are constrained by its ontological commitment. An ontology indirectly reflects this commitment (and the underlying conceptualization) by approximating these intended models.
Data Integration: The problem of combining data from multiple heterogeneous data sources and providing a unified view of these sources to the user. Such unified view is structured according to a global schema. Issues addressed by a data integration system include specifying the mapping between the global schema and the sources and processing queries expressed on the global schema.
Ontology Mapping: Given two ontologies, A and B, mapping one ontology with another means that for each concept (node) in ontology A, we try to find a corresponding concept (node) that has the same or similar semantics in ontology B, and vice versa.
Semantic Web: An extension of the current Web in which information is given well-defined meaning, better enabling computers and people to work in cooperation. Berners-Lee, et al. (2001) said that in the context of the Semantic Web, the word semantic meant “machine-processable.” They explicitly ruled out the sense of natural language semantics. For data, the semantics convey what a machine can do with those data.
Semantic Integration: In semantic integration, sources export not only their logical schema but also their conceptual model to the mediator, thus exposing their concepts, roles, classification hierarchies, and other high-level semantic constructs to the mediator. Semantic integration allows information sources to export their schema at an appropriate level of abstraction to the mediator.