Special Offers
- IGI Global’s New Emerging Topic e-Book Collections
  Acquire highly focused and affordable Cutting-Edge Peer-Reviewed Research Content through a selection of 17 topic-focused e-Book Collections discounted up to 90%, compared to list prices. Collection topics include Artificial Intelligence, Data Science, Language Learning, Marketing and Customer Relations, Sustainability, and many more. Hosted on the InfoSci^® platform, these collections feature no DRM, no additional cost for multi-user licensing, no embargo of content, full-text PDF & HTML format, and more.
  Learn More
- Open Access Book (Free Access) - Encyclopedia of Information Science and Technology, Sixth Edition (ISBN: 9781668473665)
  The Encyclopedia of Information Science and Technology, Sixth Edition) continues the legacy set forth by the first five editions by providing comprehensive coverage and up-to-date definitions of the most important issues, concepts, and trends pertaining to technological advancements and information management within a variety of settings and industries. The entire book is being published under open access.
  Read Now
- Open Access Book (Free Access) - Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries (ISBN: 9781668456293)
  Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries provides information on the recent technology, mitigation, and environmental protection that must be applied for food sustainability in developing countries. This book is being published under Platinum Open Access through funding from Diponegoro University, Indonesia.
  Read Now
- Open Access Book (Free Access) - New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY (ISBN: 9781668438091)
  The Walmart Corporation and the Lumina Foundation have provided funding to make New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY fully open access, completely removing any paywall between scholars in education and the latest research on new models for the future of higher education.
  Read Now
- Open Access Book (Free Access) - Handbook of Research on the Global View of Open Access and Scholarly Communications (ISBN: 9781799898054)
  Through a collaboration between IGI Global and the University of North Texas, the Handbook of Research on the Global View of Open Access and Scholarly Communications has been published as fully open access, completely removing any paywall between researchers of any field, and the latest research on the equitable and inclusive nature of Open Access and all of its complications.
  Read Now
Books
- - Books by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Books by Field
Journals
- - Journals
  - OnDemand Journal Articles
  - Journals by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Journals by Field
e-Collections
Open Access
- View All Open Access Opportunities
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Find an Open Access Journal for Your Next Manuscript
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Submit an Open Access Book Proposal
  Learn more about open access book publishing and how it can propel your research forward in the field.
  Convert Your Work to Open Access
  Already published? You can convert your work to open access to increase its impact through IGI Global’s Restrospective Open Access Program.
  Utilize Open Access Collection Database
  Open up your research potential by utilizing our open access content or integrating the open access collection into your library
  Consider Open Access Agreements
  For Libraries: consider no-cost or investment-level open access agreements with IGI Global to support your faculty's research endeavors.
  Search Funding Resources
  Looking for additional funding resources to support your open accesss endeavors? View industry resources compiled by our open access team.
  Review Open Access Policies & Ethical Guidelines
  Considering IGI Global to publish your work under open access? Review IGI Global’s open access policies and ethical guidelines
Publish with Us
Resources
- - Instructors
  - Course Adoption
  - Teaching Cases
  - K-12 Online Learning Collection
  - Authors and Editors
  - eEditorial Discovery^® System
  - Peer Review Process
  - Ethics and Malpractice
  - COPE Membership
  - Fair Use Policy
  - Open Access Publishing
  - FAQ
Catalogs
About Us
Newsroom

Arabic Authorship Attribution Using Synthetic Minority Over-Sampling Technique and Principal Components Analysis for Imbalanced Documents

Hassina Hadjadj, Halim Sayoud

Source Title: International Journal of Cognitive Informatics and Natural Intelligence (IJCINI) 15(4)

DOI: 10.4018/IJCINI.20211001.oa33

Article PDF Download Open access articles are freely available for download

Abstract

Nowadays, dealing with imbalanced data represents a great challenge in data mining as well as in machine learning task. In this investigation, we are interested in the problem of class imbalance in Authorship Attribution (AA) task, with specific application on Arabic text data. This article proposes a new hybrid approach based on Principal Components Analysis (PCA) and Synthetic Minority Over-sampling Technique (SMOTE), which considerably improve the performances of authorship attribution on imbalanced data. The used dataset contains 7 Arabic books written by 7 different scholars, which are segmented into text segments of the same size, with an average length of 2900 words per text. The obtained results of our experiments show that the proposed approach using the SMO-SVM classifier, presents high performance in terms of authorship attribution accuracy (100%), especially with starting character-bigrams. In addition, the proposed method appears quite interesting by improving the AA performances in imbalanced datasets, mainly with function words.

Article Preview

Top

1. Introduction

Authorship attribution (AA) is one of the earliest research fields of computational linguistics and has a long history in identifying disputed or unknown authors (Mosteller & Wallace, 1984). Several researchers were interested in a myriad of applications of AA such as email authorship verification, categorizing harassing emails and anonymous messages in textual conversations and social media forensics (Rocha et al., 2017), Online criminality (Edwards, 2018). In addition, AA can be used to identify the document sources (Li et al., 2013), disputed authorship (Eder, 2015), plagiarism detection in student essays (AlSallal et al., 2019), etc.

AA consists of studying the author’s writing pattern (or stylometry) to respond to the following question: Who is the author of this document?. Accordingly, the suitable set of features is extracted and combined with the more reliable classification technique to find the right author. In this regard, function words (stop words) and the spelling errors should be kept, because they have a substantial role in the identification task. On the other hand, two parameters are important in stylometry and should be exploited, namely: the text’s length (number of words) and the number of authors. In addition, some researchers have set some conditions to accurately identify the authors, such as the same theme, the same genre (i.e. poems, news, scientific papers, etc.) and the same period of time. However, features extraction is not the only operator that influences AA, where there are other factors such as the dataset size (training and test), number of candidate authors and the distribution of the training corpus over the authors (balanced or unbalanced dataset).

Over decades, several stylometric features have been investigated and applied in AA, where there are a myriad of stylometric features commonly used in stylometry such as sentence length and vocabulary richness (Yule, 1994), function words (Holmes et al., 2001; Zhao & Zobel, 2005), punctuation marks (Baayen et al., 2002) and characters n-gram (Juola, 2004). The use of function words to produce best performances is due to two reasons. The first one is their frequency in the document, which is very hardly under conscious control and reduces the risk of false attribution. Secondly, function words, unlike content words, are totally independent from the text’s topic or genre (Argamon et al., 2007).

Regarding the number of related works carried out in AA, most of them addressed Latin languages (e.g. English), while a few researches were conducted on the Arabic language and in particular those using unbalanced data. Several researchers confirmed that balanced datasets provide high accuracies in contrast to unbalanced datasets (Li et al., 2018; Pan et al., 2020). However, it is tricky to collect sufficient data for each author. In this regard, the aim of this investigation is to deal with Arabic AA using unbalanced dataset, where seven Arabic books with different text lengths and written by different authors (in the same period) have been used.

To resolve AA problem, we have used the unsupervised principal component analysis (PCA) combined with the oversampling technique (SMOTE) to reduce data dimensionality. We have conducted a series of experiments using our dataset (i.e. SAB-2 dataset), where PCA was applied to eliminate irrelevant features, and subsequently SMOTE resampling was used to balance the class distribution and increase the variety of sample domains. Finally, SMO-SVM and BayesNet classifiers were applied on the filtered dataset, where they were compared using different evaluation metrics. The hybrid approach combining both algorithms showed interesting performances (100% of accuracy) on unbalanced data.

This paper is organized as follows: in section 2, we present some related works on AA and Arabic AA. The dataset is described in section 3, while section 4 presents our AA approach. Finally, we present the experimental results in section 5, and section 6 gives a short conclusion on this research work.

Complete Article List

Search this Journal:

Reset

Volume 18: 1 Issue (2024)

Volume 17: 1 Issue (2023)

Volume 16: 1 Issue (2022)

Volume 15: 4 Issues (2021)

Volume 14: 4 Issues (2020)

Volume 13: 4 Issues (2019)

Volume 12: 4 Issues (2018)

Volume 11: 4 Issues (2017)

Volume 10: 4 Issues (2016)

Volume 9: 4 Issues (2015)

Volume 8: 4 Issues (2014)

Volume 7: 4 Issues (2013)

Volume 6: 4 Issues (2012)

Volume 5: 4 Issues (2011)

Volume 4: 4 Issues (2010)

Volume 3: 4 Issues (2009)

Volume 2: 4 Issues (2008)

Volume 1: 4 Issues (2007)

View Complete Journal Contents Listing

MLA

APA

Chicago

Export Reference

Arabic Authorship Attribution Using Synthetic Minority Over-Sampling Technique and Principal Components Analysis for Imbalanced Documents

Abstract

1. Introduction

Complete Article List