Performance Enhancement of the Unbalanced Text Classification Problem Through a Modified Chi Square-Based Feature Selection Technique: A Mod-Chi based FS technique

Performance Enhancement of the Unbalanced Text Classification Problem Through a Modified Chi Square-Based Feature Selection Technique: A Mod-Chi based FS technique

Santosh Kumar Behera, Rajashree Dash
Copyright: © 2022 |Pages: 23
DOI: 10.4018/IJIIT.309581
Article PDF Download
Open access articles are freely available for download

Abstract

This paper proposes a modified chi square-based feature selection algorithm in conjunction with a random vector functional link network-based text classifier for improving the classification performance of multi-labeled text documents with unbalanced class distributions. In the proposed feature selection method, maximum features are selected from classes that have a great deal of training and testing documents as an improvement towards original chi-square method. On two benchmark datasets that are multi-labeled, multi-class, and unbalanced, a comparison of the model with three conventional selection techniques such as chi-square, term frequency-inverse document frequency, and mutual information is accumulated for assessing its effectiveness. Additionally, the proposed model is compared with four different classifiers. In the study, it was found that the proposed model performs better in terms of precision, recall, f-measure, and hamming losses and is able to select the majority of true positive documents despite an unbalanced class distribution for both the datasets.
Article Preview
Top

1. Introduction

Thanks to the quick development of information technology, news articles, internet sites, emails, and digital libraries are all accessible as electronic text documents. In order to manage such vast amounts of data, text classification (TC) has evolved as an imperative tool for locating as well as categorizing text content. Unlabeled text documents are typically routed to one or more established categories utilizing the text classification problem, depending on the content of the documents (Harish & Revanasiddappa, 2017). In many applications, including spam detection (Crawford et al., 2015), document categorization (Jiang et al., 2016), sentiment analysis (Bakshi et al., 2016), email classification (Nikhath et al., 2016), text summarization (Jo, 2017), and soon, it has been witnessed that the TC problem is easily adapted. It is always a difficult effort for academics to improve TC preciseness through study of lots of extremely sparse phrases and skewed realities that are internal to the texts. Consequently, feature selection (FS) is a key component of text categorization. Additionally, it makes choosing the best features for TC more difficult if the document collection’s text is connected to numerous grouping and there is an uneven distribution of classes.

Complete Article List

Search this Journal:
Reset
Volume 20: 1 Issue (2024)
Volume 19: 1 Issue (2023)
Volume 18: 4 Issues (2022): 3 Released, 1 Forthcoming
Volume 17: 4 Issues (2021)
Volume 16: 4 Issues (2020)
Volume 15: 4 Issues (2019)
Volume 14: 4 Issues (2018)
Volume 13: 4 Issues (2017)
Volume 12: 4 Issues (2016)
Volume 11: 4 Issues (2015)
Volume 10: 4 Issues (2014)
Volume 9: 4 Issues (2013)
Volume 8: 4 Issues (2012)
Volume 7: 4 Issues (2011)
Volume 6: 4 Issues (2010)
Volume 5: 4 Issues (2009)
Volume 4: 4 Issues (2008)
Volume 3: 4 Issues (2007)
Volume 2: 4 Issues (2006)
Volume 1: 4 Issues (2005)
View Complete Journal Contents Listing