Techniques for Specialized Data Compression

Techniques for Specialized Data Compression

Jakub Swacha (University of Szczecin, Poland)
DOI: 10.4018/978-1-4666-5888-2.ch351
OnDemand PDF Download:
List Price: $37.50


With the emergence of large volumes of data in various domains of research and industry, came the need for their compact representation, motivated by both technological and economic reasons. The available general-purpose compression methods often attain sub-optimal compression ratios. Much better compression ratios can be obtained using specialized data compression techniques as they can make use of those types of redundancy that are not exploited by general-purpose data compression methods. In this article, selected techniques for specialized data compression are reviewed.
Chapter Preview


The base technique of data compression, that is included in most contemporary compression methods, is statistical coding. It uses statistics of occurrence of respective symbols in the input stream to minimize the size of the output stream. The most widely-used technique of this kind is Huffman coding, which assigns short codewords to frequent input symbols, and long codewords to rare symbols (Huffman, 1952). The Huffman coding is optimal in the sense that no other codeword assignment could produce shorter output stream. Further improvement is still possible by assigning value ranges, instead of individual codewords, to input symbols. Such approach is taken by the arithmetic coding, where the entire input stream is encoded with a binary fraction representing its cumulative probability (Rissanen, 1976).

Most real-world data types exhibit some form of correlation between symbols within them, which cannot be exploited by mere counting occurrences of individual symbols in the input stream. There are three popular types of solutions to this problem. The first (also historically) exploits the idea of grouping input symbols into phrases, which are treated in a way similar to the other input symbols. One approach of this kind (represented, e.g., by Deflate) uses the processed input stream buffer as a dictionary (Ziv & Lempel, 1977) whereas another (represented, e.g., by LZW) maintains an explicit phrase dictionary (Ziv & Lempel, 1978). Although solutions of this type do not achieve superior compression ratios, due to their modest usage of resources, they became the standard of general-purpose compression, used commonly in compression utilities, such as PKZip, gzip, 7-Zip, or WinRar, proprietary container formats, such as CDR, CHM, JAR or SWF, and in various data transmission protocols.

Key Terms in this Chapter

Lossless Data Compression: Data compression where the decompressed data match exactly the data before compression.

Lossy Data Compression: Data compression where the decompressed data do not exactly match the data before compression.

Specialized Data Compression Technique: Data compression technique designed to handle specific kind of redundancy that is found in one or more types of data.

Specialized Data Compression Method: Data compression method designed and useful only for a specific type of data. Specialized data compression methods usually use a combination of specialized data compression techniques and a back-end general-purpose data compression method.

General-Purpose Data Compression Method: Data compression method designed and useful for a wide range of data types.

Compression Ratio: The ratio between the size of data before compression and after compression.

Data Compression: Encoding data aimed at reducing their size.

Complete Chapter List

Search this Book: