Information retrieval is the computational discipline that deals with the efficient representation, organization, and access to information objects that represent natural language texts (Baeza-Yates, & Ribeiro-Neto, 1999; Salton & McGill, 1983; Witten, Moûat, & Bell, 1999). A crucial subproblem in the information retrieval area is the design and implementation of efficient data structures and algorithms for indexing and searching information objects that are vaguely described. In this article, we are going to present the latest developments in the indexing area by giving special emphasis to: data structures and algorithmic techniques for string manipulation, space efficient implementations, and compression techniques for efficient storage of information objects. The aforementioned problems appear in a series of applications as digital libraries, molecular sequence databases (DNA sequences, protein databases [Gusûeld, 1997)], implementation of Web search engines, web mining and information filtering.
Dictionary Data Structures
The dictionary data structure stores a set S of n elements in order to support the operations of insertion, deletion, and the test of membership. A basic criterion for categorizing dictionary data structures is whether only comparisons are used, or the representation of elements for guiding the search is also employed. Typical representatives of the former group are search trees and of the latter tries and hashing. Search trees need O(logn) update/search time and O(n) space and the most prominent examples of them are: AVL-trees, red-black trees, (α,b)-trees, BB[α]-trees and Weight Balanced B-trees (Arge, & Vitter, 1996; Cormen, Leiserson, & Rivest, 1990; Mehlhorn, 1984). On the other hand, tries and hashing structures (Cormen, Leiserson, & Rivest, 1990; Czegh, Havas, & Majewski, 1997; Pagh, 2002) try to use the representation (for example, the value of the element written as a string of digits or the value itself), to compute directly the element’s position in system’s memory. The time and space complexities of these structures generally vary; however, it should be mentioned that a lately developed structure (Anderson & Thorup, 2001) answers both search and update operations in O () time. This structure is also able to retrieve the largest element in the stored set smaller than a query element (predecessor query).
Key Terms in this Chapter
String: A sequence of symbols drawn from an alphabet; differently, it is a series of alphanumeric characters or a series of keywords used to characterize an information object.
Secondary Memory Algorithms: Algorithms for handling information on secondary media, like hard disks, CD-ROMs, etc., and try to minimize the number of page accesses in them.
Compression: The process of encoding information Maintain using fewer information units than a more obvious representation would use, by employing specific encoding schemes that try to exploit the inherent entropy and/or redundancy of the input.
Bioinformatics: A scientific field that stands at the crossroads of Biology and Informatics, and uses techniques from informatics, computer science, applied mathematics, and statistics to solve biological problems. A synonym term is computational biology, but whereas computational biology deals mainly with the scientific filled bioinformatics is usually used to indicate the infrastructure part.
Data Structures: A collection of methods for storing and organizing sets of data in order to facilitate access to them. More formally, data structures are concise implementations of abstract data types, where an abstract data type is a set of objects together with a collection of operations on the elements of the set.
Information Retrieval: The scientific discipline that deals with the representation, organization, storage, and maintenance of information objects and in particular textual objects. The representation and organization of the information items should provide the user with easy access to the relevant information and satisfy the user’s various information needs.
World Wide Web: A distributed hypertext-based information system that operates over the Internet and permits the easy access to available information by employing a special software called a “Web browser”.
Database: A collection of interrelated persistent information stored and organized as a unit in order to serve a specific purpose and satisfy the demands of a set of users. A database can be considered to be an electronic filling system stored on mass-storage systems such as magnetic tape or disk. A database is one component of a database management system.