Indexing and Compressing Text

Indexing and Compressing Text

Ioannis Kouris, Christos Makris, Evangelos Theodoridis, Athanasios Tsakalidis
DOI: 10.4018/978-1-4666-5888-2.ch173
OnDemand:
(Individual Chapters)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Chapter Preview

Top

Background

Dictionary Data Structures

The dictionary data structure stores a set S of n elements in order to support the operations of insertion, deletion and the test of membership. A basic criterion for categorizing dictionary data structures is whether only comparisons are used, or the representation of elements for guiding the search is also employed. Typical representatives of the former group are search trees and of the latter tries and hashing. Search trees need O(logn) update/search time and O(n) space and the most prominent examples of them are: AVL-trees, red-black trees, (α,b)-trees, BB[α]-trees and Weight Balanced B-trees (Mehlhorn, 1984; Cormen, Leiserson, & Rivest, 1990; Arge & Vitter, 1996). On the other hand, tries and hashing structures (Cormen, Leiserson, & Rivest, 1990; Pagh, 2002; Czegh, Havas, & Majewski, 1997) try to use the representation (for example the value of the element written as a string of digits or the value itself), to compute directly the element’s position in system’s memory. The time and space complexities of these structures generally vary; however it should be mentioned that a lately developed structure (Andersson & Thorup, 2001) answers both search and update operations in O(978-1-4666-5888-2.ch173.m01) time. This structure is also able to retrieve the largest element in the stored set smaller than a query element (predecessor query).

Key Terms in this Chapter

String: A sequence of symbols drawn from an alphabet, differently it is a series of alphanumeric characters or a series of keywords used to characterize an information object.

Data Structures: A collection of methods for storing and organizing sets of data in order to facilitate access to them. More formally data structures are concise implementations of abstract data types, where an abstract data type is a set of objects together with a collection of operations on the elements of the set.

Information Retrieval: The scientific discipline that deals with the representation, organization, storage and maintenance of information objects and in particular textual objects. The representation and organization of the information items should provide the user with easy access to the relevant information and satisfy the user’s various information needs.

World Wide Web: A distributed hypertext-based information system that operates over the Internet and permits the easy access to available information by employing a special software called a “web browser.”

Database: A collection of interrelated persistent information stored and organized as a unit in order to serve a specific purpose and satisfy the demands of a set of users. A database can be considered to be an electronic filling system stored on mass-storage systems such as magnetic tape or disk. A database is one component of a database management system.

Secondary Memory Algorithms: Algorithms for handling information on secondary media, like hard disks, CD-ROMs etc., and try to minimize the number of page accesses in them.

Compression: The process of encoding information using fewer information units than a more obvious representation would use, by employing specific encoding schemes that try to exploit the inherent entropy and/or redundancy of the input.

Bioinformatics: A scientific field that stands at the crossroads of Biology and Informatics, and uses techniques from informatics, computer science, applied mathematics, and statistics to solve biological problems. A synonym term is Computational Biology, but whereas Computational Biology deals mainly with the scientific filed Bioinformatics is usually used to indicate the infrastructure part.

Complete Chapter List

Search this Book:
Reset