Reverse Engineering from an XML Document into an Extended DTD Graph

Reverse Engineering from an XML Document into an Extended DTD Graph

Herbert Shiu (City University of Hong Kong, Hong Kong) and Joseph Fong (City University of Hong Kong, Hong Kong)
DOI: 10.4018/978-1-60960-521-6.ch006
OnDemand PDF Download:
List Price: $37.50


Extensible Markup Language (XML) has become a standard for persistent storage and data interchange via the Internet due to its openness, self-descriptiveness and flexibility. This paper proposes a systematic approach to reverse engineer arbitrary XML documents to their conceptual schema – Extended DTD Graphs ? which is a DTD Graph with data semantics. The proposed approach not only determines the structure of the XML document, but also derives candidate data semantics from the XML element instances by treating each XML element instance as a record in a table of a relational database. One application of the determined data semantics is to verify the linkages among elements. Implicit and explicit referential linkages are among XML elements modeled by the parent-children structure and ID/IDREF(S) respectively. As a result, an arbitrary XML document can be reverse engineered into its conceptual schema in an Extended DTD Graph format.
Chapter Preview


As Extensible Markup Language (XML)(Bray, 2004) has become the standard document format, the chance that users have to deal with XML documents with different structures is increasing. If the schema of the XML documents in Document Type Definition (DTD)(Bosak, 1998) is given or derived from the XML documents right away(Kay, 1999; Moh, 2000), it is easier to study the contents of the XML documents. However, the formats of these schemas are hard to read, not to mention rather poor user-friendliness.

XML has been the common format for storing and transferring data between software applications and even business parties, as most software applications can generate or handle XML documents. For example, a common scenario is that XML documents are generated and based on the data stored in a relational database ― and there have been various approaches for doing so(Thiran, 2004; Fernandez, 2001). The sizes of XML documents that are generated based on the data stored in databases can be very large. Most probably, these documents are stored in a persistent storage for backup purposes, as XML is the ideal format that can be processed by any software applications in the future.

In order to handle the above scenario, it is possible to treat XML element instances in an XML document as individual entities, and the relationships from the different XML element types can be determined by reverse engineering them for their conceptual models, such as Extended DTD Graphs with data semantics. As such, users can have a better understanding of the contents of the XML document and further operations with the XML document become possible, such as storing and querying(Florescu 1999; Deutsch, 1999; Kanne, 2000).

This paper proposes several algorithms that analyze XML documents for their conceptual schema. Two main categories of XML documents exist ― data-centric and narrative. As the contents of narrative XML documents, such as DocBook(Bob Stayton, 2008) documents, are mainly unstructured and their vocabulary is basically static, the necessity of handling them as structured contents and reverse engineering them into conceptual models is far less than that of handling data-centric ones. Therefore, this paper will concentrate on data centric XML documents.

Complete Chapter List

Search this Book: