Authorship Attribution and the Digital Humanities Curriculum

Patrick Juola (Duquesne University, USA)
DOI: 10.4018/978-1-60566-932-8.ch001
Although authorship attribution is simply the determination of who wrote a document by analysis of its content, it is a long-standing problem both in the humanities and in computational text analysis. While traditional methods involve identifying key aspects of style through close reading, new developments in computational science permit a more objective approach through the statistical analysis of superficial characteristics such as vocabulary and word choice. If a writer can be shown (statistically) to have a particular stylistic quirk (‘stylome’) that appears broadly across his or her writing, then other writings also displaying that quirk are good candidates to also be by that author. The present chapter describes some of the statistical techniques used to make such judgments, and describes one particular computer program (JGAAP) that is freely available for this purpose. This type of analysis is capable of determining authorship with relatively high accuracy The potential creates some significant implications for authorship questions across the humanities curriculum, as well as broader impacts in the world outside the academy. In light of these implications, I argue for the inclusion of more mathematics into the humanities curriculum.
Touchstones And Errors

Assessing documents to determine who wrote them is a classic problem for scholars; you find a manuscript in an archive and look for quirks of handwriting and phrasing that may give a hint to who wrote it and when. Indeed, the problem of identifying groups of people by their language goes back, literally, at least to the Old Testament:

And the Gileadites took the passages of Jordan before the Ephraimites: and it was so, that when those Ephraimites which were escaped said, Let me go over; that the men of Gilead said unto him, Art thou an Ephraimite? If he said, Nay;

6 Then said they until him, Say now Shibboleth; and he said Sibboleth: for he could not frame to pronounce it right. Then they took him, and slew him at the passages of Jordan: and there fell at that time of the Ephraimites forty and two thousand (Judges 12:5-6).

This illustrates one of the simplest methods of determining the author of a text; if you can find a straightforward, all-or-nothing touchstone that only the person of interest would use (or would avoid using), find that in the relevant document and you have your answer. Wellman (1936) describes a similar event in a legal case: he had in his possession documents that would win the case if the jury could be persuaded that a specific witness had written them. These documents were notable for their repeated misspelling of *toutch. By persuading the witness under some pretence to take dictation in court, he was able to show that she misspelled touch in exactly that same way (and thus presumably was the author he was looking for).

Another example comes from the famous Beale Manuscript (Singh, 2000). Ostensibly a letter written by Beale in 1822 but not published until 1860, it accompanies a famous set of coded directions to find a fortune in buried treasure. To this day, the directions have only been partly decoded and the treasure has never been found – if it ever existed at all. Scholars are justifiably suspicious on this point, in part because the letter uses words and phrases (such as stampede or improvised tools) that were not employed in 1822, but were in use by 1860 (Kruh, 1982, 1988).

Less formally, every teacher knows to look for damning similarities in answers, especially in wrong answers. If two students both write identical gibberish in response to a homework problem, something is up. Similarly, many teachers have adopted a standard practice of running a search engine like Google on suspicious phrases from essays in the hopes of finding the original source from which students have copied nearly word-for-word. All these examples have in common the idea that there is a single piece of information, a smoking gun, as it were, to prove or more commonly disprove the purported author.

