A Novel Approach for Ontology-Based Feature Vector Generation for Web Text Document Classification

A Novel Approach for Ontology-Based Feature Vector Generation for Web Text Document Classification

Mohamed K. Elhadad (Computer Engineering Department, Military Technical College, Cairo, Egypt), Khaled M. Badran (Computer Engineering Department, Military Technical College, Cairo, Egypt) and Gouda I. Salama (Computer Engineering Department, Military Technical College, Cairo, Egypt)
Copyright: © 2018 |Pages: 10
DOI: 10.4018/IJSI.2018010101
OnDemand PDF Download:
$30.00
List Price: $37.50

Abstract

The task of extracting the used feature vector in mining tasks (classification, clustering …etc.) is considered the most important task for enhancing the text processing capabilities. This paper proposes a novel approach to be used in building the feature vector used in web text document classification process; adding semantics in the generated feature vector. This approach is based on utilizing the benefit of the hierarchal structure of the WordNet ontology, to eliminate meaningless words from the generated feature vector that has no semantic relation with any of WordNet lexical categories; this leads to the reduction of the feature vector size without losing information on the text, also enriching the feature vector by concatenating each word with its corresponding WordNet lexical category. For mining tasks, the Vector Space Model (VSM) is used to represent text documents and the Term Frequency Inverse Document Frequency (TFIDF) is used as a term weighting technique. The proposed ontology based approach was evaluated against the Principal component analysis (PCA) approach, and against an ontology based reduction technique without the process of adding semantics to the generated feature vector using several experiments with five different classifiers (SVM, JRIP, J48, Naive-Bayes, and kNN). The experimental results reveal the effectiveness of the authors' proposed approach against other traditional approaches to achieve a better classification accuracy F-measure, precision, and recall.
Article Preview

In this section, we briefly review some background research including the handling of the process of extracting the feature vector from text documents, some of the previously applied techniques for web text document classification, and some previous attempts to apply semantic knowledge to enhance the classification accuracy.

In References (Rasane, 2016; Uma, 2016; Venkata Sailaja, 2016), a full review of the current trends for text documents classification, and classification algorithms are introduced and the techniques used in extracting feature vectors used in different mining tasks. Also in (Said, 2007), a comparative study between Dimensionality reduction (DR) techniques that allows users to make comprehensive choices among available techniques for enhancing automatic text categorization is conducted.

In reference (Davy, 2007), the PCA has been used as an efficient technique for dimensionality reduction for text document classification, the experimental results shows that using dimensionality reduction techniques significantly increases the performance results when using a KNN classification algorithm over two benchmark corpora (Subset of 20 Newsgroups and a Subset of Reuters-21578).it uses both Document Frequency performed Globally technique and Principal Components Analysis technique for dimensionality reduction. In both sets of experiments PCA technique was found to outperform Document Frequency performed globally technique.

Complete Article List

Search this Journal:
Reset
Open Access Articles: Forthcoming
Volume 6: 4 Issues (2018): 1 Released, 3 Forthcoming
Volume 5: 4 Issues (2017)
Volume 4: 4 Issues (2016)
Volume 3: 4 Issues (2015)
Volume 2: 4 Issues (2014)
Volume 1: 4 Issues (2013)
View Complete Journal Contents Listing