State of the Art Technology in Compressing and Querying XML Documents

State of the Art Technology in Compressing and Querying XML Documents

Badya Al-Hamadani (University of Huddersfield, UK) and Joan Lu (University of Huddersfield, UK)
DOI: 10.4018/978-1-4666-1975-3.ch009


Since this research consists of two main parts, the XML compressor and the vague query processor, this chapter discusses the main XML compression techniques in its first part. It will highlight the advantages and disadvantages of these techniques and discusses the differences between them. The second part of this chapter will focus on the vague query processors used to retrieve information from XML documents.
Chapter Preview

1.1 Xml Compression Techniques

Recently, large numbers of XML compression techniques have been proposed. Each of which has different characteristics. This section discusses the differences between these compressors and their main features.

XML compressors can be classified into two classes either to be XML-blind or XML-conscious compressors. XML-blind or general purpose compressors deal with the XML document as a traditional text document ignoring its structure and apply the general purpose text compression techniques to compress them. These techniques can be classified into two main classes (Salomon, 2007), either to be statistical or dictionary based compressors (Augeri et al., 2007; Augeri, 2008). The statistical or the arithmetic compressors represent each string of characters using a fixed number of bits per character. PPM, CACM3, and PAQ are examples of this kind of compressors (Cleary and Witten, 1984; Moffat., 1990; Alistair et al., 1998). On the other hand, dictionary compression techniques substitute each string in the input by its reference in a dictionary maintained by the encoder. WinZip, GZIP, and BZIP2 are examples of this compression class (WinZip, 1990; GZip, 1992; BZip2, 1996).

On the other hand, XML-conscious compressors try to utilize the structural behaviour of XML documents in order to achieve better compression ratio and less time in comparative with the XML-blind type. Table 1 sets the main differences between the two aforementioned compressors types.

Table 1.
The main differences between XML-conscious and XML-blind compressors
XML-conscious compressorsXML-blind compressors
Information about XML documents is usually available in schema which can be optimized by XML-conscious compressors to get better compression.Cannot take advantage of the schema to get useful information about the file.
They utilize the structure of XML document and the type of the data inside.They do not take in consideration the entire file structure or data types.
Some of them abridge the original XML tree in a summary or compact tree for better ratio.They cannot exploit redundancies in the XML tree structure.
Most of them are powerful in compressing small or large files.They do not efficiently compress small files that can be used in transactions for e-business. (Hung, 2009)

The main theory of data compression, which described in (Shannon, 1948), is the formulation of the entropy rate (H) which indicates the limit to lossless data compression. The value of (H) depends on the probability of each symbol in the information source. The most popular entropy value is (Shannon, 1948):

(1) where, is the probability of the symbol

Complete Chapter List

Search this Book: