Effective Technique to Reduce the Dimension of Text Data

Effective Technique to Reduce the Dimension of Text Data

D.S. Guru (Department of Studies in Computer Science, University of Mysore, Mysore, India), K. Swarnalatha (MIT Thandavapura, India), N. Vinay Kumar (Department of Studies in Computer Science, University of Mysore, Mysore, India) and Basavaraj S. Anami (Karnataka Lingayat Education Institute of Technology, Karnataka, India)
Copyright: © 2020 |Pages: 19
DOI: 10.4018/IJCVIP.2020010104

Abstract

In this article, features are selected using feature clustering and ranking of features for imbalanced text data. Initially the text documents are represented in lower dimension using the term class relevance (TCR) method. The class wise clustering is recommended to balance the documents in each class. Subsequently, the clusters are treated as classes and the documents of each cluster are represented in the lower dimensional form using the TCR again. The features are clustered and for each feature cluster the cluster representative is selected and these representatives are used as selected features of the documents. Hence, this proposed model reduces the dimension to a smaller number of features. For selecting the cluster representative, four feature evaluation methods are used and classification is done by using SVM classifier. The performance of the method is compared with the global feature ranking method. The experiment is conducted on two benchmark datasets the Reuters-21578 and the TDT2 dataset. The experimental results show that this method performs well when compared to the other existing works.
Article Preview
Top

1. Introduction

Automatic text classification is in high demand due to increase in text content on the web and also because of usage of web applications in all fields of our day to day life as is very difficult to classify manually (Sebastiani, 2002). The text classification is an important tool in many applications such as text spotting, news categorization, sentiment analysis, spam analysis, etc. (Harish et al., 2010; Aggarwal & Zhai, 2012). The most commonly used text representation technique is Bag-of-Words or Vector space model (Rehman et al., 2015; Sebastiani, 2002). A text corpus usually contains a large vocabulary of terms and it generates high dimension with noisy features when represented in the vector form. This high dimension and noisy features are the two major issues in effective classification of text documents. Therefore, selection of features for the representation of text is an essential stage in any text classification system as it reduces the dimension of the data and also reduction in computation time (Guyon et al., 2006).

The documents in a corpus are represented in the form of document term matrix of size N × d (Guru & Suhil, 2015), where N is the total number of documents and d is the total number of terms present in the corpus. However, this representation being sparse and reduces the efficiency of the text classification system due to the presence of noisy, irrelevant and redundant features (Ferreira & Figueiredo, 2012; Guru et al., 2018; Debole & Sebastiani, 2003). An effective representation of text can relieve this pitfall and also increase the performance of a text classification system. Hence an effective feature selection (FS) is essential to select the best features for representation of text data.

In this work a new method of selecting features through clustering of features is done using supervised and unsupervised feature evaluation criteria. In literature, we can see very less attempts found on feature selection through clustering of features (Goswami et al., 2017).

The rest of the paper is organized as follows: In section 2 related works are discussed. Section 3 presents feature ranking criteria, the complete proposed model is given in section 4 and the experimental results along with dataset and comparison analysis is given in section 5. Finally, the conclusion is given in section 6.

Complete Article List

Search this Journal:
Reset
Open Access Articles: Forthcoming
Volume 10: 4 Issues (2020): 1 Released, 3 Forthcoming
Volume 9: 4 Issues (2019)
Volume 8: 4 Issues (2018)
Volume 7: 4 Issues (2017)
Volume 6: 2 Issues (2016)
Volume 5: 2 Issues (2015)
Volume 4: 2 Issues (2014)
Volume 3: 4 Issues (2013)
Volume 2: 4 Issues (2012)
Volume 1: 4 Issues (2011)
View Complete Journal Contents Listing