A Systematic Review on Author Identification Methods

A Systematic Review on Author Identification Methods

Sunil Digamberrao Kale (Smt. Kashibai Navale College of Engineering, Pune, India) and Rajesh Shardanand Prasad (Computer Engineering Department, NBN Sinhgad School of Engineering (Savitribai Phule Pune University), Pune, India)
Copyright: © 2017 |Pages: 11
DOI: 10.4018/IJRSDA.2017040106
OnDemand PDF Download:
$30.00
List Price: $37.50

Abstract

Author Identification is a technique for identifying author of anonymous text. It has near about 130 year's long history, started with the work by Mendenhall 1987. Applications of Author identification include plagiarism detection, detecting anonymous author, in forensics and so on. In this paper the authors outline features used for Author identification like vocabulary, syntactic and others. Researchers worked on various methods for Author identification they also outline this paper on types of Author Identification methods that include 1. Profile-based Approaches which includes Probabilistic Models, Compression Models, Common n-Grams (CNG) approach, 2. Instance-based Approaches which includes Vector Space Models, Similarity-based Models, Meta-learning Models and 3. Hybrid Approaches. At the end the authors conclude this paper with observations and future scope.
Article Preview

Introduction

Author identification is a technique of identifying author of a given document. Author identification method includes extraction of features from a given text data such as word length, vocabulary richness, use of digits, Letter Trigrams, Words, Letter Bigrams, Function Words, POS Bigram, POS tag, Letters, POS Trigrams, Prepositions, Word length, Pronouns, Conjunctions and so on. Statistical computations are applied on these feature set and compared on both training and test data. Process of identifying author is to provide training data as text written by different authors suppose 10 authors and 10 documents of each author (In total 100 text files) then identify feature set like Word Count, length of Statement, Richness of Word, Special Characters, Number of times use of particular Word, Punctuation, Conjunctions, Pronouns and so on. Next step is based on this features classify authors, then give input text i.e. test data that should belong to one of the authors used in training set and not same text data as which was used in training data, then with statistical computation calculate feature count of each feature of input text compare with feature count of authors used in training data where match is found that one is author for input text. The main idea behind statistically is computationally-supported authorship attribution that is measured by some textual features we can distinguish between texts written by different authors. Figure 1 shows general framework of Author Identification where Author-T1 to Author-Tn represents training data and Author-2 is identified author of unknown author i.e. anonymous text document by comparing results of training author’s data.

Figure 1.

Framework of Author identification

The first tries to quantify the writing in 19th century, by Mendenhall (1887) on the plays of Shakespeare followed by statistical studies. (Mosteller & Wallace, 1964) in his work used data set as federalist paper these are essays of newspaper published in 1987 &1988 by John Jay, Alexander Hamiltonian & James Madison. Total 85 essays out of those 5 essays wrote by John, 51 by Alexander and 14 by James. However, 3 are written jointly by Alexander and James. 12 out of 85 essays are disputed claimed by Alexander and James. Mosteller & Wallace used method. First they have used identifying feature as synonyms pairs but it not worked due to insufficient synonyms pairs then they have used 30 function words and that worked. This was the birth of statistical analysis they used probabilities and Bayesian analysis. Bayesian Classification based on conditional probabilities. (Halder, 2014) introduced a new concept of Bayesian decision theoretic rough set.

The organization of this paper is as follows. First we outline on features proposed for Author identification then proposed methods for Author identification at the end we conclude this paper with observation and future scope.

Complete Article List

Search this Journal:
Reset
Open Access Articles: Forthcoming
Volume 5: 4 Issues (2018): 1 Released, 3 Forthcoming
Volume 4: 4 Issues (2017)
Volume 3: 4 Issues (2016)
Volume 2: 2 Issues (2015)
Volume 1: 2 Issues (2014)
View Complete Journal Contents Listing