Article Preview
TopIntroduction
Author identification is a technique of identifying author of a given document. Author identification method includes extraction of features from a given text data such as word length, vocabulary richness, use of digits, Letter Trigrams, Words, Letter Bigrams, Function Words, POS Bigram, POS tag, Letters, POS Trigrams, Prepositions, Word length, Pronouns, Conjunctions and so on. Statistical computations are applied on these feature set and compared on both training and test data. Process of identifying author is to provide training data as text written by different authors suppose 10 authors and 10 documents of each author (In total 100 text files) then identify feature set like Word Count, length of Statement, Richness of Word, Special Characters, Number of times use of particular Word, Punctuation, Conjunctions, Pronouns and so on. Next step is based on this features classify authors, then give input text i.e. test data that should belong to one of the authors used in training set and not same text data as which was used in training data, then with statistical computation calculate feature count of each feature of input text compare with feature count of authors used in training data where match is found that one is author for input text. The main idea behind statistically is computationally-supported authorship attribution that is measured by some textual features we can distinguish between texts written by different authors. Figure 1 shows general framework of Author Identification where Author-T1 to Author-Tn represents training data and Author-2 is identified author of unknown author i.e. anonymous text document by comparing results of training author’s data.
Figure 1. Framework of Author identification
The first tries to quantify the writing in 19th century, by Mendenhall (1887) on the plays of Shakespeare followed by statistical studies. (Mosteller & Wallace, 1964) in his work used data set as federalist paper these are essays of newspaper published in 1987 &1988 by John Jay, Alexander Hamiltonian & James Madison. Total 85 essays out of those 5 essays wrote by John, 51 by Alexander and 14 by James. However, 3 are written jointly by Alexander and James. 12 out of 85 essays are disputed claimed by Alexander and James. Mosteller & Wallace used method. First they have used identifying feature as synonyms pairs but it not worked due to insufficient synonyms pairs then they have used 30 function words and that worked. This was the birth of statistical analysis they used probabilities and Bayesian analysis. Bayesian Classification based on conditional probabilities. (Halder, 2014) introduced a new concept of Bayesian decision theoretic rough set.
The organization of this paper is as follows. First we outline on features proposed for Author identification then proposed methods for Author identification at the end we conclude this paper with observation and future scope.