Arabic Authorship Attribution Using Synthetic Minority Over-Sampling Technique and Principal Components Analysis for Imbalanced Documents

Arabic Authorship Attribution Using Synthetic Minority Over-Sampling Technique and Principal Components Analysis for Imbalanced Documents

Hassina Hadjadj, Halim Sayoud
DOI: 10.4018/IJCINI.20211001.oa33
Article PDF Download
Open access articles are freely available for download

Abstract

Nowadays, dealing with imbalanced data represents a great challenge in data mining as well as in machine learning task. In this investigation, we are interested in the problem of class imbalance in Authorship Attribution (AA) task, with specific application on Arabic text data. This article proposes a new hybrid approach based on Principal Components Analysis (PCA) and Synthetic Minority Over-sampling Technique (SMOTE), which considerably improve the performances of authorship attribution on imbalanced data. The used dataset contains 7 Arabic books written by 7 different scholars, which are segmented into text segments of the same size, with an average length of 2900 words per text. The obtained results of our experiments show that the proposed approach using the SMO-SVM classifier, presents high performance in terms of authorship attribution accuracy (100%), especially with starting character-bigrams. In addition, the proposed method appears quite interesting by improving the AA performances in imbalanced datasets, mainly with function words.
Article Preview
Top

1. Introduction

Authorship attribution (AA) is one of the earliest research fields of computational linguistics and has a long history in identifying disputed or unknown authors (Mosteller & Wallace, 1984). Several researchers were interested in a myriad of applications of AA such as email authorship verification, categorizing harassing emails and anonymous messages in textual conversations and social media forensics (Rocha et al., 2017), Online criminality (Edwards, 2018). In addition, AA can be used to identify the document sources (Li et al., 2013), disputed authorship (Eder, 2015), plagiarism detection in student essays (AlSallal et al., 2019), etc.

AA consists of studying the author’s writing pattern (or stylometry) to respond to the following question: Who is the author of this document?. Accordingly, the suitable set of features is extracted and combined with the more reliable classification technique to find the right author. In this regard, function words (stop words) and the spelling errors should be kept, because they have a substantial role in the identification task. On the other hand, two parameters are important in stylometry and should be exploited, namely: the text’s length (number of words) and the number of authors. In addition, some researchers have set some conditions to accurately identify the authors, such as the same theme, the same genre (i.e. poems, news, scientific papers, etc.) and the same period of time. However, features extraction is not the only operator that influences AA, where there are other factors such as the dataset size (training and test), number of candidate authors and the distribution of the training corpus over the authors (balanced or unbalanced dataset).

Over decades, several stylometric features have been investigated and applied in AA, where there are a myriad of stylometric features commonly used in stylometry such as sentence length and vocabulary richness (Yule, 1994), function words (Holmes et al., 2001; Zhao & Zobel, 2005), punctuation marks (Baayen et al., 2002) and characters n-gram (Juola, 2004). The use of function words to produce best performances is due to two reasons. The first one is their frequency in the document, which is very hardly under conscious control and reduces the risk of false attribution. Secondly, function words, unlike content words, are totally independent from the text’s topic or genre (Argamon et al., 2007).

Regarding the number of related works carried out in AA, most of them addressed Latin languages (e.g. English), while a few researches were conducted on the Arabic language and in particular those using unbalanced data. Several researchers confirmed that balanced datasets provide high accuracies in contrast to unbalanced datasets (Li et al., 2018; Pan et al., 2020). However, it is tricky to collect sufficient data for each author. In this regard, the aim of this investigation is to deal with Arabic AA using unbalanced dataset, where seven Arabic books with different text lengths and written by different authors (in the same period) have been used.

To resolve AA problem, we have used the unsupervised principal component analysis (PCA) combined with the oversampling technique (SMOTE) to reduce data dimensionality. We have conducted a series of experiments using our dataset (i.e. SAB-2 dataset), where PCA was applied to eliminate irrelevant features, and subsequently SMOTE resampling was used to balance the class distribution and increase the variety of sample domains. Finally, SMO-SVM and BayesNet classifiers were applied on the filtered dataset, where they were compared using different evaluation metrics. The hybrid approach combining both algorithms showed interesting performances (100% of accuracy) on unbalanced data.

This paper is organized as follows: in section 2, we present some related works on AA and Arabic AA. The dataset is described in section 3, while section 4 presents our AA approach. Finally, we present the experimental results in section 5, and section 6 gives a short conclusion on this research work.

Complete Article List

Search this Journal:
Reset
Volume 18: 1 Issue (2024)
Volume 17: 1 Issue (2023)
Volume 16: 1 Issue (2022)
Volume 15: 4 Issues (2021)
Volume 14: 4 Issues (2020)
Volume 13: 4 Issues (2019)
Volume 12: 4 Issues (2018)
Volume 11: 4 Issues (2017)
Volume 10: 4 Issues (2016)
Volume 9: 4 Issues (2015)
Volume 8: 4 Issues (2014)
Volume 7: 4 Issues (2013)
Volume 6: 4 Issues (2012)
Volume 5: 4 Issues (2011)
Volume 4: 4 Issues (2010)
Volume 3: 4 Issues (2009)
Volume 2: 4 Issues (2008)
Volume 1: 4 Issues (2007)
View Complete Journal Contents Listing