Re-Ordered FEGC and Block Based FEGC for Inverted File Compression

Re-Ordered FEGC and Block Based FEGC for Inverted File Compression

V. Glory (National Institute of Technology, Tiruchirappalli, Tamil Nadu, India) and S. Domnic (Department of Computer Applications, National Institute of Technology, Tiruchirappalli, Tamil Nadu, India)
Copyright: © 2013 |Pages: 18
DOI: 10.4018/ijirr.2013010105
OnDemand PDF Download:
$30.00
List Price: $37.50

Abstract

Data compression has been widely used in many Information Retrieval based applications like web search engines, digital libraries, etc. to enable the retrieval of data to be faster. In these applications, universal codes (Elias codes (EC), Fibonacci code (FC), Rice code (RC), Extended Golomb code (EGC), Fast Extended Golomb code (FEGC) etc.) have been preferably used than statistical codes (Huffman codes, Arithmetic codes etc). Universal codes are easy to be constructed and decoded than statistical codes. In this paper, the authors have proposed two methods to construct universal codes based on the ideas used in Rice code and Fast Extended Golomb Code. One of the authors’ methods, Re-ordered FEGC, can be suitable to represent small, middle and large range integers where Rice code works well for small and middle range integers. It is also competing with FC, EGC and FEGC in representing small, middle and large range integers. But it could be faster in decoding than FC, EGC and FEGC. The authors’ another coder, Block based RFEGC, uses local divisor rather than global divisor to improve the performance (both compression and decompression) of RFEGC. To evaluate the performance of the authors’ coders, the authors have applied their methods to compress the integer values of the inverted files constructed from TREC, Wikipedia and FIRE collections. Experimental results show that their coders achieve better performance (both compression and decompression) for those files which contain significant distribution of middle and large range integers.
Article Preview

1. Introduction

Data compression plays a vital role in data transmission and data storage applications owing to its capability of minimizing the amount of data and transmission time. Compression can be classified into two categories: lossy compression and lossless compression. In lossless compression, the information can be exactly reproduced as the original data. But in lossy compression, the information loss can occur. Many real-time applications which carry the vital information use lossless compression. Statistical codes or Universal codes have been used in lossless compression.

The codes which are constructed by statistical methods are called as statistical codes and their construction is based on the probability values of the symbols to be coded. The Huffman (Huffman, 1952) and Shannon-Fano (Salomon, 2000) algorithms are the examples for statistical methods. Elias Gamma Code (EC) (Elias, 1975), Elias Delta Code (DC) (Elias, 1975), Golomb-Rice Code (RC) (Golomb, 1966; Rice, 1979), Fibonacci Code (FC) (Fenwick, 2002), Variable Byte Code (VBC) (Salomon, 2007), Nibble Code (NC) (Salomon, 2007), Extended Golomb code (EGC) (Somasundaram & Domnic, 2007) and Fast Extended Golomb Code (FEGC) (Domnic & Glory, 2012) are universal codes which do not require the probabilities of the symbols for their construction. Normally statistical methods take more time to encode and decode the symbols of the file compared to universal codes. The applications which require fast encoding and decoding the data need universal codes rather than statistical codes. One of such applications is Information Retrieval System.

Information Retrieval System (IRS) is an information system that is used to store items of information that need to be processed, searched and retrieved corresponding to the users’ query (Salton & McGill, 1983). IRS is widely used in many applications such as digital libraries, search engines, e-commerce, electronic news, genomic sequence analysis, etc… (Kobayashi & Takeda, 2000; Williams & Zobel, 2002). Indexing is one of the efficient techniques used to locate the data for fast retrieval in IRS. The most commonly used indexing structure is inverted index (Zobel, Moffat & Ramamohanarao, 1998) for fast query evaluations compared to signature files (Faloutsos, 1985), PAT tress (Morrison, 1968) and Bitmaps (Chan & Ioannidis, 1998).

Inverted file contains two components:

  • 1.

    Lexicon file (dictionary) that

    • a.

      Stores the distinct terms (ti) in the collection arranged in ascending lexicographical order,

    • b.

      Has document frequency (dft) which indicates the total number of documents in which the term t appears

  • 2.

    Inverted list (posting list) that

    • a.

      Stores the identifiers of the documents idi in which the term t has occurred,

    • b.

      Has term frequency (ft)) which indicates the number of occurrences of the term t in the document (idi),

    • c.

      Has its associated set of positions (pi) of term t in the document (idi).

Complete Article List

Search this Journal:
Reset
Open Access Articles: Forthcoming
Volume 8: 4 Issues (2018): 1 Released, 3 Forthcoming
Volume 7: 4 Issues (2017)
Volume 6: 4 Issues (2016)
Volume 5: 4 Issues (2015)
Volume 4: 4 Issues (2014)
Volume 3: 4 Issues (2013)
Volume 2: 4 Issues (2012)
Volume 1: 4 Issues (2011)
View Complete Journal Contents Listing