With the growing importance of XML in data exchange, much research has been done in providing flexible query mechanisms to extract data from XML documents. A core operation for XML query processing is to find all occurrences of a twig pattern Q (or small tree) in a document T. Prior work has typically decomposed Q into binary structural relationships, such as parent-child and ancestor-descendant relations, or root-to-leaf paths. The twig matching is achieved by: (a) matching the binary relationships or paths against XML databases, and (b) using the join algorithms to stitch together all the matching binary relationships or paths. In the worst case, the time for doing joins can be exponential (in the number of query nodes or decomposed paths). In this chapter, we discuss a new algorithm for this task with no path joins involved. The time and space complexities of the algorithm are bounded by O (|T|·Qleaf) and O (Tleaf·Qleaf), respectively, where Tleaf stands for the number of the leaf nodes in T and Qleaf for the number of the leaf nodes in Q. Our experiments show that our method is efficient in supporting twig pattern queries.
In an XML databases, a set of XML documents is maintained. Abstractly, each document can be considered as a tree structure with each node labeled with an element name from a finite alphabet ∑ or a string value, and an edge for the elment-subelement relationship. For instance, the XML document shown in Figure. 1(a) can be represented a tree structure as shown in Figure 1(b).
A sample XML file and its tree representation
Queries in XML query languages, such as XPath (Florescu and Kossman, 1999; World Wide Web Consortium, 2007), XQuery (World Wide Web Consortium, 2001), XML-QL (Dutch, Fernandez, Florescu, 1999), and Quilt (Chamberlin, et al., 1999; Chamberlin, et al., 2000), typically specify patterns of selection predicates on multiple elements that also have some specified tree structured relations. As an example, consider the query tree shown in Figure 3.
This query asks for any node labeled with location (node 2) that is a child of some node labeled with hotel-room-reservation (node 1). In addition, location node (node 2) is the parent of some city node (node 4) and an ancestor of the text ‘Portage Ave.’ (node 5). location (node 3) can also be the parent of some address node (node 7).
The query corresponds to the following XPath expression:
hotel-room[location[city and //‘500 Portage Ave.’]]/location[city and address//‘500 Portage Ave.’].
Key Terms in this Chapter
Node Subsumption: A node labeled with (p’, q’) is said to be subsumed by (p, q),), if p’ > p and q’ < q.
XML Database: A database designed for managing and manipulating XML documents or even more generic SGML documents.
XML Document: A document consisting of an (optional) XML declaration, followed by either an (optional) DTD or XML schema, and then followed by a document element.
Tree Pattern Query: Queries that are based on the path expressions including element tags, attributes, and key words, which are often represented as a tree structure.
XML Schema: An alternative to DTDs. It is a schema language that assesses the validity of a well-formed element and attribute information items within an XML document. There are two major schema models: W3C XML Schema and Microsoft Schema.
Tree Embedding: An embedding f of a tree pattern Q into an XML document T satisfying the following conditions: (i) Preserve node label: For each u ? Q, u and f(u) are of the same label (i.e., label(u) = label(f(u)). (ii) Preserve c/d-child relationships: If u ? v in Q, then f(v) is a child of f(u) in T; if u ? v in Q, then f(v) is a descendant of f(u) in T.
XPath Expression: An expression contains tree node names separated by ‘parent/child’ separators ‘/’ or ‘ancestor/descendant’ separators ‘//’. A node name may also associated with a predicate.
Tree Encoding: A way labeling the nodes of a tree, which can be used to identify the relationships between the nodes, such as parent-child, and ancestor-descendant, as well as left-to-right relationships.