Exploiting Semantics to Improve Classification of Text Corpus

Exploiting Semantics to Improve Classification of Text Corpus

Hammad Majeed (National University of Computer and Emerging Sciences (NUCES), Pakistan) and Firoza Erum (National University of Computer and Emerging Sciences (NUCES), Pakistan)
Copyright: © 2016 |Pages: 14
DOI: 10.4018/978-1-4666-9767-6.ch002
OnDemand PDF Download:
No Current Special Offers


Internet is growing fast with millions of web pages containing information on every topic. The data placed on Internet is not organized which makes the search process difficult. Classification of the web pages in some predefined classes can improve the organization of this data. In this chapter a semantic based technique is presented to classify text corpus with high accuracy. This technique uses some well-known pre-processing techniques like word stemming, term frequency, and degree of uniqueness. In addition to this a new semantic similarity measure is computed between different terms. The authors believe that semantic similarity based comparison in addition to syntactic matching makes the classification process significantly accurate. The proposed technique is tested on a benchmark dataset and results are compared with already published results. The obtained results are significantly better and that too by using quite small sized highly relevant feature set.
Chapter Preview


Web page classification (WPC) also known as web page categorization is the identification of membership of a web page. Choi and Yao (Choi & Yao, 2005) defined it mathematically as:

Let C represents predefined categories C = {c1, c2, …, ck} and D = {d1, d2, …, dn} represents the number of web pages or documents need to be classified. The decision matrix be Z=DxC where each entry represents either belonging to a set {0, 1} where 1 indicates the document di belonging to category ci, and 0 indicates not belonging to the category. A document can belong to more than one category. Web page classification means approximating function f: DxC → {0, 1} by a learned function called a classifier 978-1-4666-9767-6.ch002.m01 both the functions closely match each other. The 978-1-4666-9767-6.ch002.m02 is acquired by machine learning over training examples; each training example has a label of category to which it belongs. The function 978-1-4666-9767-6.ch002.m03 is used during training and the classification of web pages. The decision matrix is given in Table 1.

Table 1.
Decision matrix for classifying n documents in k classes

Basic Approaches Of Web Page Classification

Web page classification is a supervised machine-learning problem in which a web page is categorized using a trained classifier. The web pages are written in HTML and are semi structured in nature. They are connected through hyperlinks forming a directed graph. The data on web is frequent, non-homogeneous and vigorously changing.

Complete Chapter List

Search this Book: