Efficient Querying Distributed Big-XML Data using MapReduce

Efficient Querying Distributed Big-XML Data using MapReduce

Song Kunfang (Huazhong University of Science and Technology, Wuhan, China) and Hongwei Lu (Huazhong University of Science and Technology, Wuhan, China)
Copyright: © 2016 |Pages: 10
DOI: 10.4018/IJGHPC.2016070105


MapReduce is a widely adopted computing framework for data-intensive applications running on clusters. This paper proposed an approach to exploit data parallelisms in XML processing using MapReduce in Hadoop. The authors' solution seamlessly integrates data storage, labeling, indexing, and parallel queries to process a massive amount of XML data. Specifically, the authors introduce an SDN labeling algorithm and a distributed hierarchical index using DHTs. More importantly, an advanced two-phase MapReduce solution are designed that is able to efficiently address the issues of labeling, indexing, and query processing on big XML data. The experimental results show the efficiency and effectiveness of the proposed parallel XML data approach using Hadoop.
Article Preview


XML processing has been extensively studied in the literature. The XML operator typically includes labeling, indexing, and keywords searching, among which labeling and indexing are two important components. Since semantics are defined using the notion of lowest common ancestor (LCA), at the heart of existing query algorithms is the Dewey labeling (Xu, Ling, Wu & Bao, 2009). The Dewey label of a node u is a concatenation of all its ancestor nodes' local label on the path from the document root to v. Much attention has been paid to keywords searching on XML files. It is demanding to design efficient query processing methods for keyword searching on XML data, because XML applications require fast query performance to meet the needs of a large number of users. To improve XML processing speed in the MapReduce framework, we design a sequence depth number or SDN labeling, a flexible indexing model using the distributed hash table or DHT.

This study is focused on XML files that adopt the standard XML format, where each file is characterized as an ordered, rooted, and labeled tree (Quan & Moon, 2001). Each edge represents an element-element relationship or an element-value relationship. Each element is identified by a pair of start-tag and end-tag; elements may have attributes with their values. If keyword k appears at least once in one of a node name, an attribute name, and text value of root node v, we say v directly contains k.

To speed up the query process, each node is usually assigned with a label uniquely representing v; the label can be used to compute positional relationships. Most existing labeling methods are assigned with the Dewey encoding. In our solution, we assign each node with a sequence depth number (SDN) that is compatible with the XML document order using a parallel processing technique. All labeled nodes are stored in DHTs on the Hadoop distributed file system or HDFS; the tag name is the key and the text value with prefix label is the value.

More concretely, the contributions of this paper can be summarized as follows:

  • We develop the SDN labeling technique for each element in Hadoop distributed file system and construct a flexible indexing model based on DHTs, thereby improving query performance of XML datasets stored in HDFS;

  • We design an efficient query process in the form of two MapReduce jobs, and the B-SLCA keyword search approach with SDN label in DHTs is developed, which is a bottom-up retrieval way to quickly find an SLCA node.

Complete Article List

Search this Journal:
Open Access Articles: Forthcoming
Volume 11: 4 Issues (2019): 1 Released, 3 Forthcoming
Volume 10: 4 Issues (2018)
Volume 9: 4 Issues (2017)
Volume 8: 4 Issues (2016)
Volume 7: 4 Issues (2015)
Volume 6: 4 Issues (2014)
Volume 5: 4 Issues (2013)
Volume 4: 4 Issues (2012)
Volume 3: 4 Issues (2011)
Volume 2: 4 Issues (2010)
Volume 1: 4 Issues (2009)
View Complete Journal Contents Listing