Authorship Attribution of Noisy Text Data With a Comparative Study of Clustering Methods

Authorship Attribution of Noisy Text Data With a Comparative Study of Clustering Methods

Zohra Hamadache (USTHB University, Bab Ezzouar, Algeria) and Halim Sayoud (USTHB University, Bab Ezzouar, Algeria)
Copyright: © 2018 |Pages: 25
DOI: 10.4018/IJKSS.2018040103
OnDemand PDF Download:
No Current Special Offers


Through the fast development and intensification of the large volume of data via the internet, visual analytics (VA) comes out with the intention of visualizing multidimensional data in different ways, which reveals interesting information about the data, making them clearer and more intelligible. In this investigation, the authors focused on the VA based Authorship Attribution (AA) task, applied on noisy text data. Furthermore, this article proposes 3D Visual Analytics technique based on sphere implementation. The used dataset contains several text documents written by 5 American Philosophers, with an average length of 850 words per text, which were scanned and then corrupted with different noise levels. The obtained results show that the hierarchical clustering technique using a fully-automated threshold, presents high performance in terms of authorship attribution accuracy, especially with character trigrams and ending bigrams, where the clustering recognition rate (CRR) reaches an accuracy of 100% at noise levels: from 0% to 7%. In addition, the proposed 3D sphere technique appears quite interesting by showing high clustering performances, mainly with Words.
Article Preview

1. Introduction

In this section, we will briefly introduce the Authorship Attribution, which is the background problem of this investigation, as well as its most important basics.

1.1. Authorship Attribution

Centuries ago, authorship attribution (or author identification) was an issue that concerned many researchers because of its important role in authentication. Today, the problem is still persisting and becomes an essential way to solve mainly internet information problems such as plagiarism and fraud detection, identifying a source of documents (Li et al., 2013), identifying new authors occurring in streaming data source (Seker et al., 2013), disputed authorship (Eder et al., 2013; Khonji et al., 2015; Napoli et al., 2015; Segarra et al., 2015; Varela et al., 2016), detecting anonymous letters and harassing e-mails or messages or identifying authors for conversational texts and social media forensics (Inches et al., 2013; Okuno et al., 2014; Spitters et al., 2015; Rocha et al., 2017), etc. The AA field studies the writing style of an author, also called “stylometry”, in order to identify an anonymous digital or handwritten text segment of an author. Accordingly, the suitable features of the text document should be extracted, and then combined with an appropriate clustering technique to retrieve the right author. For the identification task, the spelling mistakes and stop-words must be kept because they play a very important role to define the appropriate author.

Complete Article List

Search this Journal:
Volume 14: 1 Issue (2023)
Volume 13: 4 Issues (2022): 2 Released, 2 Forthcoming
Volume 12: 4 Issues (2021)
Volume 11: 4 Issues (2020)
Volume 10: 4 Issues (2019)
Volume 9: 4 Issues (2018)
Volume 8: 4 Issues (2017)
Volume 7: 4 Issues (2016)
Volume 6: 4 Issues (2015)
Volume 5: 4 Issues (2014)
Volume 4: 4 Issues (2013)
Volume 3: 4 Issues (2012)
Volume 2: 4 Issues (2011)
Volume 1: 4 Issues (2010)
View Complete Journal Contents Listing