Enhanced Filter Feature Selection Methods for Arabic Text Categorization

Enhanced Filter Feature Selection Methods for Arabic Text Categorization

Abdullah Saeed Ghareb, Azuraliza Abu Bakara, Qasem A. Al-Radaideh, Abdul Razak Hamdan
Copyright: © 2018 |Pages: 24
DOI: 10.4018/IJIRR.2018040101
OnDemand:
(Individual Articles)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

The filtering of a large amount of data is an important process in data mining tasks, particularly for the categorization of unstructured high dimensional data. Therefore, a feature selection process is desired to reduce the space of high dimensional data into small relevant subset dimensions that represent the best features for text categorization. In this article, three enhanced filter feature selection methods, Category Relevant Feature Measure, Modified Category Discriminated Measure, and Odd Ratio2, are proposed. These methods combine the relevant information about features in both the inter- and intra-category. The effectiveness of the proposed methods with Naïve Bayes and associative classification is evaluated by traditional measures of text categorization, namely, macro-averaging of precision, recall, and F-measure. Experiments are conducted on three Arabic text datasets used for text categorization. The experimental results showed that the proposed methods are able to achieve better and comparable results when compared to 12 well known traditional methods.
Article Preview
Top

1. Introduction

The volume of text in its digital form is continuously increasing, and it has become an area of research interest to investigate and construct new techniques and methods that can handle and control this considerable amount of text. Text categorization techniques have become one of the key technologies for text pattern recognition and text organization; however, these techniques struggle with the large dimensionality of text because there is a huge number of text features. Therefore, the feature selection (FS) process can be of benefit to categorization techniques, and it is used as a categorization pre-process for dimensionality reduction and relevant feature selection that can simplify the categorization process and identify precisely the pattern of text features (Mesleh, 2011; Uguz, 2011).

Text categorization is a forecasting process of text document categories that categorizes text documents into one/multiple pre-defined categories based on the extracted knowledge from text documents (Manning and Schutze, 1999). In recent years, many categorization methods were proposed for categorizing text of different languages such as: K-nearest neighbour (KNN) (Abu Tair & Baraka, 2013; Jiang et al., 2012), support vector machine (SVM) (Joachims, 1998; Mesleh, 2011), naïve Bayes (NB) (Chen et al., 2009; Hattab & Hussein, 2013) decision tree (Harrag et al., 2010), and associative classification (AC) (Al-Radaideh et al., 2011; Chaing et al., 2008; Ghareb et al., 2012; Srvidhya & Anitha, 2009; Khorsheed & Al-Thubaity, 2013).

The high dimensionality of the feature space is a major problem in Arabic text categorization due to the existence of noisy and irrelevant features, which adversely affect the categorization performance and degrades computer resource (Abu Tair & Baraka, 2013; Sharef et al., 2012; Zahran & Kanaan, 2009). Therefore, the feature selection process is frequently used to reduce the high dimensionality of text and select the most informative features for Arabic text categorization. It can be defined as “a process that chooses an optimal subset of features according to certain criterion” (Liu & Motoda, 1998). Feature selection techniques can be categorized into groups based on their functionality for feature evaluation. As cited in the literature, feature selection methods for Arabic text categorization can be divided into filtering feature selection and meta-heuristic/wrapper approaches. The filter feature selection methods are frequently used for Arabic text categorization because of their efficiency and simplicity. They include chi-square, odd ratio, document frequency, information gain, and many others (Mesleh, 2011; Harrag et al., 2010).

Complete Article List

Search this Journal:
Reset
Volume 14: 1 Issue (2024)
Volume 13: 1 Issue (2023)
Volume 12: 4 Issues (2022): 3 Released, 1 Forthcoming
Volume 11: 4 Issues (2021)
Volume 10: 4 Issues (2020)
Volume 9: 4 Issues (2019)
Volume 8: 4 Issues (2018)
Volume 7: 4 Issues (2017)
Volume 6: 4 Issues (2016)
Volume 5: 4 Issues (2015)
Volume 4: 4 Issues (2014)
Volume 3: 4 Issues (2013)
Volume 2: 4 Issues (2012)
Volume 1: 4 Issues (2011)
View Complete Journal Contents Listing