A New Algorithm for Minimizing Tree Pattern Queries

A New Algorithm for Minimizing Tree Pattern Queries

Yangjun Chen (University of Winnipeg, Canada)
Copyright: © 2009 |Pages: 9
DOI: 10.4018/978-1-59904-845-1.ch079
OnDemand PDF Download:
$30.00
List Price: $37.50

Abstract

XML employs a tree-structured model for representing data. Queries in XML query languages, for example, XPath (World Wide Web Consortium, 1999), XQuery (World Wide Web Consortium, 2001), XML-QL (Deutch, Fernandex, Florescu, Levy, & Suciu, 1999), and Quilt (Chamberlin, Clark, Florescu, & Stefanescu 1999; Chamberlin, Robie, & Florescu, 2000), typically specify patterns of selection predicates on multiple elements that have some specified tree structured relationships. For instance, the following XPath expression: a[b[c and //d]]/b[c and e//d] asks for any node of type b that is a child of some node of type a. In addition, the b-node is the parent of some c-node and some e-node, as well as an ancestor of some d-node. In general, such an expression can be represented by a tree structure as shown in Figure 1(a). In such a tree pattern, the nodes are types from S ? {*} (* is a wildcard, matching any node type), and edges are parent-child or ancestor-descendant relationships. Among all the nodes of a query Q, one is designated as the output node, denoted by output(Q), corresponding to the output of the query.
Chapter Preview
Top

Introduction

XML employs a tree-structured model for representing data. Queries in XML query languages, for example, XPath (World Wide Web Consortium, 1999), XQuery (World Wide Web Consortium, 2001), XML-QL (Deutch, Fernandex, Florescu, Levy, & Suciu, 1999), and Quilt (Chamberlin, Clark, Florescu, & Stefanescu 1999; Chamberlin, Robie, & Florescu, 2000), typically specify patterns of selection predicates on multiple elements that have some specified tree structured relationships. For instance, the following XPath expression:

a[b[c and //d]]/b[c and e//d]

asks for any node of type b that is a child of some node of type a. In addition, the b-node is the parent of some c-node and some e-node, as well as an ancestor of some d-node. In general, such an expression can be represented by a tree structure as shown in Figure 1(a).

Figure 1.

A query tree

In such a tree pattern, the nodes are types from ∑ ∪ {*} (* is a wildcard, matching any node type), and edges are parent-child or ancestor-descendant relationships. Among all the nodes of a query Q, one is designated as the output node, denoted by output(Q), corresponding to the output of the query.

In the following discussion, we use τ(v) to denote the type of node v. A parent-child edge is referred to as a c-edge and a c-edge from node v to node u is denoted by vu in the text. Also, u is called a c-child of v. An ancestor-descendant edge is referred to as a d-edge and a d-edge is denoted by vu in the text. u is called a d-child of v. The output node is indicated by “-” in the figures.

In any DAG (directed acyclic graph), a node u is said to be a descendant of a node v if there exists a path (sequence of edges) from v to u. In the case of a TPQ, this path could consist of any sequence of c-edges and/or d-edges.

In terms of Ramanen (1999), an embedding of a tree pattern query (TPQ) Q into an XML document T is a mapping f: QT, from the nodes of Q to the nodes of T, which satisfies the following conditions:

  • i.

    Preserve node type: For each vQ, v and f(v) are of the same type.

  • ii.

    Preserve c/d-child relationships: If vu in Q, then f(u) is a child of f(v) in T; if vu in Q, then f(u) is a descendant of f(v) in T.

Any document T, in which Q can be embedded, is said to contain Q and considered to be an answer. An embedding of Q in T with root(Q) = root(T) is called a root preserving embedding. According to the above definition, more than one node (of the same type) in Q could be mapped to the same node in T. In general, the efficiency of finding the result of a query on a given input database depends on the size of the query. Therefore, it is important to minimize the query before attempting to compute the result of the query. In this article, we propose a new algorithm for this task, which needs O(|Q|2) time and O(|Q|⋅leafQ) space, where leafQ represents the number of the leaf nodes of Q.

Key Terms in this Chapter

XML Document: A document consisting of an (optional) XML declaration, followed by either an (optional) DTD or XML schema, and then followed by a document element.

Tree Pattern Minimization: A process to recognize similar parts in a tree pattern query. All the similar parts will be removed to reduce the size of the query.

Integrity Constraints: A set of rules that the data in an XML document database satisfy, such as required child, required descendent, and type-subtype relationship.

Tree Embedding: A mapping that maps a tree pattern query into an XML document by which the node types, and parent-child and ancestor-descendent relationships are preserved.

Tree Labeling: A method to assign the nodes of a tree a number or a bit string, which reflects some relationshiops among the nodes and can be used to facilitate computation.

Tree Pattern Query: Queries represented as a tree structure, where the nodes are types from ? ? {*} (* is a wildcard, matching any node type), and edges are parent-child or ancestor-descendant relationships. Among all the nodes of a query Q, one is designated as the output node, denoted by output(Q), corresponding to the output of the query.

Document Database: A database designed for managing and manipulating XML documents or even more generic SGML documents.

Complete Chapter List

Search this Book:
Reset