Unified Data Model for Large-Scale Multi-Schema Integration (ULMI)

Unified Data Model for Large-Scale Multi-Schema Integration (ULMI)

Michael Dietrich (SAP Research Karlsruhe, Germany) and Jens Lemcke (SAP Research Karlsruhe, Germany)
DOI: 10.4018/978-1-4666-0146-8.ch020
OnDemand PDF Download:
$30.00
List Price: $37.50

Abstract

Current approaches in schema mapping and matching focus on pair-wise comparison of schemas. This chapter gives an overview of how n-way comparison of schemas via a unified data model for large-scale multi-schema integration (ULMI) can benefit to schema matching and mapping processes. The approach integrates a set of input schemas into one comprehensive representation. Thus, a unified data model is constructed. The unified data model represents the closure of all integrated schemas. However, as the unified data model is too complex and too large, it is never revealed to the user. Therefore, the authors derive a canonical data model which represents the most common structure of all schemas. In a use case, the advantages of the canonical data model are demonstrated. Finally, challenges for further research are derived. This work is based on excerpts from realistic input schemas, and it provides a concrete, ideal canonical data model as a reference for further research.
Chapter Preview
Top

Introduction

Software integration is a big issue. About 40% of all IT budget is spent on integration (Kastner, 2006). The main reason is lacking knowledge of the connections between the message schemas that make up the interfaces. The growing number of applications communicating via the Internet amplifies the integration challenge. In this chapter, we present the ULMI approach. ULMI stands for “Unified data model for Large-scale Multi-schema Integration”. With our approach, we operate in the domain of enterprise information integration (EII). However, ULMI extends EII by addressing also inter-company information integration. In particular, ULMI combines the strengths of the following two traditional approaches that, each on their own, solve the integration problem only partially:

For inter-company communication, e-business standards define common message structures. The properties of the approach are summarized in the first column of Table 1. Examples for e-business standards are RosettaNet (RosettaNet, 2011) and CIDX (OAGi, 2008). An e-business standard is defined for a concrete domain, such as RosettaNet for the high tech and CIDX for the chemical industry. Inside the domain, every company adapts the e-business standard to fit their individual objective. To be adaptable, an e-business standard is under-specified and consists of many optional fields to cover all potentially relevant aspects. Concrete mappings never involve the standard itself. Instead, mappings connect always two companies’ interpretations of the standard. Since e-business standards are domain-specific and under-specified, a multitude of different standards and interpretations exists. Therefore, reusing mapping knowledge for future integration projects is difficult.

Table 1.
Approaches with central data model
e-Business standardCanonical data modelUnified data model
ScopeWhole business domainSingle companyMultiple business domains
CompletenessCovers all aspectsRestricted to aspects relevant for communicationCovers all aspects
Level of detailUnder-specifiedMaximum detailMaximum detail
Mappings between…Schema and schemaSchema and CDMSchema and schema

Key Terms in this Chapter

(Leaf, Intermediary) Correspondence: Is an equivalence relation on (leaf, non-leaf) nodes of the schemas. In contrast to a mapping element, a correspondence has no direction. The correspondences define equivalence classes of schema nodes. The equivalence classes are an important ingredient for the unified and the canonical data models described in this chapter.

Canonical Data Model (CDM): Is a subgraph of the conflict-free UDM graph. The canonical data model is a tree and can be understood as a new schema. The canonical data model contains the leaves of a set of selected schemas. The canonical data model follows the most common structure among the selected schemas.

Mapping: Relates leaves of one schema to the leaves of another schema. In particular, a script that transforms a message conforming to one schema to a message conforming to the other schema comprises a mapping. This chapter takes mappings as a given and expects a mapping to be correct but likely incomplete.

UDM Graph: Is a graph that results from merging the corresponding nodes of the schemas according to the UDM. The UDM graph contains cycles if corresponding nodes are conflictingly nested in the schemas. A cycle allows for paths through the UDM graph on which properties are nested in a way that cannot be observed in any schema.

Unified Data Model (UDM): Consists of a set of schemas, the correspondences of the schemas’ nodes, and a unique label for each equivalence class.

Interface: For the sake of this chapter, an interface defines how a software system can be communicated with electronically. An interface consists of at least one schema.

Schema: Describes the structure of data for communication. In this chapter, a schema is represented as a tree. Each leaf of a schema represents semantically unique, atomic data to be communicated. The non-leaves structure the data.

Conflict-Free UDM Graph: Is a UDM graph where no conflictingly nested elements are merged. The conflict-free UDM graph is a directed acyclic graph. Every way properties are nested in the conflict-free UDM graph can be found in at least one schema.

Complete Chapter List

Search this Book:
Reset