This chapter develops a linguistically robust encryption system, LunabeL, which converts a message into syntactically and semantically innocuous text. Drawing upon linguistic criteria, LunabeL uses word replacement, with substitution classes based on traditional linguistic features (syntactic categories and subcategories), as well as features under-exploited in earlier works: semantic criteria, graphotactic structure, and inflectional class. The original message is further hidden through the use of cover texts—within these, LunabeL retains all function words and targets specific classes of content words for replacement, creating text which preserves the syntactic structure and semantic context of the original cover text. LunabeL takes advantage of cover text styles which are not expected to be necessarily comprehensible to the general public, making any semantic anomalies more opaque. This line of work has the promise of creating encrypted texts which are less detectable to human readers than earlier steganographic efforts.
We develop in this chapter Lunabel, a technique for text-based steganography. We refer to our approach more specifically as linguistic steganography, as we take into account certain linguistic criteria that past approaches to text-based steganography have not dealt with (Bergmair, 2007, and references therein). This allows us to more effectively hide information. In particular, our encrypted messages more closely resemble natural text than was possible in past approaches which lack the linguistic sophistication necessary to achieve satisfactory results.
Section 1 introduces the concept of steganography and discusses desiderata for a successful technique. In section 2, we develop Lunabel and discuss some specific choices that were made in its implementation. Section 3 discusses the details of some of the particularly important choices that were made in developing Lunabel, namely the choice of cover text in which to hide information and the compilation of word substitution classes. In section 4, we compare Lunabel to past approaches to lexical steganography. Section 5 concludes the paper.
1.1 What is Steganography?
“Steganography” means encryption by means of information hiding. It includes hiding information in any form of data, such as images, audio or video files. Our interest in this paper is text-based steganography. This refers to hiding a message in what looks like an ordinary piece of text.
1.2 Linguistic Steganography
Ways of hiding information in text have been used since antiquity. One simple method is the acrostic, in which the initial letters of successive lines of poetry spell a word or words. This method is used more for artistic purposes than for secret information exchange; nonetheless, it provides a useful illustration. Consider the following Edgar Allan Poe poem, in which the first letters of successive lines spell the word Elizabeth:
While this form of steganography may be sufficient for poetic use, a practical system has additional requirements. We would want information hiding to be more effective—the hidden information should not be readily visible to an outside observer. Equally important, the system needs to be algorithmic rather than creative; it should be possible to hide any given message in any desired text. Finally, decryption too needs to be algorithmic: given a text containing a hidden message, the hidden message should be reliably recoverable by a recipient in possession of the required decryption information (the acrostic poem presented satisfies this last requirement, but none of the others).
Key Terms in this Chapter
Density of Encryption: The ratio of words that are replaced to those that are left intact in the course of information hiding.
Sparse Substitution: A system of word substitution that does not target every word of a cover text. Function words and highly ambiguous words will typically be left out; it is a matter of choice what other words may or may not be targeted for substitution.
Linguistic Steganography: A system of steganography that strives for linguistic robustness by paying attention to linguistic criteria.
Synonym-Based Word-Replacement Systems: Systems of text-based steganography in which substitution classes consist of (nearly) synonymous words.
Linguistic Robustness: The likelihood that a cover text altered so as to hide information in it will still appear syntactically and semantically natural to human observers.
Substitution Classes: Sets of words whose members may be replaced by one another within a given genre of cover text with a high probability that the replacement will not adversely affect the syntactic and semantic plausibility of the cover text.
Cover Text: A piece of text that is altered in subtle ways to hide a message in it.
Minimum Length of Cover Text: The length of cover text required in order to hide a message of a given size. This depends on the size of substitution classes and the density of encryption.
Text-Based Steganography: A system of hiding information in a text file (as opposed to, for example, an image file).
Sentence Frames: Sequences of syntactic categories (part of speech tags) extracted from a corpus and used in some text-based stegosystems as templates for generating encrypted messages.