N-Gram Based Approach for Text Authorship Classification: Metric Selection

N-Gram Based Approach for Text Authorship Classification: Metric Selection

Elena Mikhailova (Saint Petersburg State University, Russia), Polina Diurdeva (Saint Petersburg State University, Russia) and Dmitry Shalymov (Saint Petersburg State University, Russia)
DOI: 10.4018/IJERTCS.2017070102
OnDemand PDF Download:
No Current Special Offers


Automated authorship attribution is actual to identify the author of an anonymous texts, or texts whose authorship is in doubt. It can be used in various applications including author verification, plagiarism detection, computer forensics and others. In this article, the authors analyze an approach based on frequency combination of letters is investigated for solving such a task as classification of documents by authorship. This technique could be used to identify the author of a computer program from a predefined set of possible authors. The effectiveness of this approach is significantly determined by the choice of metric. The research examines and compares four different distance measures between a text of unknown authorship and an authors' profile: L1 measure, Kullback-Leibler divergence, base metric of Common N-gram method (CNG) and a certain variation of dissimilarity measure of CNG method. Comparison outlines cases when some metric outperforms others with a specific parameter combination. Experiments are conducted on different Russian and English corpora.
Article Preview

1. Introduction

Due to increased amount of available documents in digital form, using of a method for digital documents processing has become crucial. The challenge of digital documents processing includes wide variety of problems and, in particular, a problem of author attribution is among them. The latter at the same time includes the following subtasks: author identification, verification, plagiarism detection, author profiling or characterization and others (Stamatatos, 2009). The problem of author attribution becomes extremely important because of wide distribution of anonymous, faked data in the network nowadays. It can be effectively used in computer forensics (Frantzeskou, Stamatatos, Gritzalis, & Katsikas, 2006) to identify author of a source code. In addition, this problem is significant in linguistic, criminalistics and historical research.

In this paper, the authors considered tasks of writer identification. The process of writer identification can be defined as determination of an author by general and particular texts' properties that form a writer's style. A special case of the author identification problem is the classification of documents by authorship. The latter can be stated given that an unknown author is one of the predefined set of candidates whose authorship cannot be disputed (Stamatatos, 2006a).

The writer's style is formed by extracting style marker (stylometric features) from their text (Keselj, Peng, Cercone, & Thomas, 2003). Extracting of the most style markers types require Natural Language Processing (NLP) preprocessing for their measurement which can be complicated and time-consuming. For example, calculation of word frequencies requires tokenizer, stemmer, lemmatizer, at the same time part-of-speech tagging needs tokenizer, sentences splitter, POS tagger, etc. However, there are simpler methods that do not need much preprocessing and are not less effective. For instance, today more and more researchers pay attention to N-gram approach. N-gram approach implies operating a text as a set of N character combinations. This approach does not demand sophisticated preprocessing of input text, sometimes only basic filtering is needed: removing spaces or/and punctuation marks. In addition, an important advantage of this approach is tolerance to spelling and grammatical errors (Cavnar, William, Trenkle, & John, 1994) because the proportion of mistaken N-grams relative to the total number of N-grams in general case is very small. One of the main issues of the approach is choosing appropriate method for solving the problems of author attribution.

Today, there are a large number of methods based on N-gram approach. In order to evaluate features extracted from texts and to build text attribution model based on them, the tools of mathematical statistics, probability theory, machine learning, pattern recognition and other ones are used. Selection of accurate and versatile method is an impossible task at the moment.

In this paper, the researchers discuss profile-based methods and compare some variations with each other. These methods use a concept of profile - formal representative of author's style which is the character N-gram distribution. A mapping between authors' styles and profiles is established and the problem is solved by comparing profiles (distributions) using a special metric function. The authors investigated four metrics: L1 measure and Kullback-Leibler divergence which were presented for authorship attribution task by Orlov and Osminin (2012), base metric of Common N-gram method (CNG) (Keseljet et al., 2003) and certain variation of dissimilarity measure of CNG proposed by Stamatatos (2007).

Examined metrics demonstrated good results in previous works, but its evaluation was obtained based on experiments conducted by different researchers on different corpora and with various parameters. This fact makes intuitive objective assessment of the considered metrics with respect to each other.

The obtained results for the letter frequency distribution method and metrics performance in the documents classification problem can also be used to identify the author of a computer program from a predefined set of possible authors. This task is important for code plagiarism analysis, proof of authorship (in court), tracing the source code left after a cyber-attack (viruses, Trojan horses, fraud etc.) (Frantzeskou et al., 2006).

Complete Article List

Search this Journal:
Open Access Articles
Volume 13: 4 Issues (2022): Forthcoming, Available for Pre-Order
Volume 12: 4 Issues (2021): 2 Released, 2 Forthcoming
Volume 11: 4 Issues (2020)
Volume 10: 4 Issues (2019)
Volume 9: 2 Issues (2018)
Volume 8: 2 Issues (2017)
Volume 7: 2 Issues (2016)
Volume 6: 2 Issues (2015)
Volume 5: 4 Issues (2014)
Volume 4: 4 Issues (2013)
Volume 3: 4 Issues (2012)
Volume 2: 4 Issues (2011)
Volume 1: 4 Issues (2010)
View Complete Journal Contents Listing