Exploiting Semantics to Improve Classification of Text Corpus

Exploiting Semantics to Improve Classification of Text Corpus

Hammad Majeed, Firoza Erum
Copyright: © 2016 |Pages: 14
DOI: 10.4018/978-1-4666-9767-6.ch002
OnDemand:
(Individual Chapters)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

Internet is growing fast with millions of web pages containing information on every topic. The data placed on Internet is not organized which makes the search process difficult. Classification of the web pages in some predefined classes can improve the organization of this data. In this chapter a semantic based technique is presented to classify text corpus with high accuracy. This technique uses some well-known pre-processing techniques like word stemming, term frequency, and degree of uniqueness. In addition to this a new semantic similarity measure is computed between different terms. The authors believe that semantic similarity based comparison in addition to syntactic matching makes the classification process significantly accurate. The proposed technique is tested on a benchmark dataset and results are compared with already published results. The obtained results are significantly better and that too by using quite small sized highly relevant feature set.
Chapter Preview
Top

Introduction

Web page classification (WPC) also known as web page categorization is the identification of membership of a web page. Choi and Yao (Choi & Yao, 2005) defined it mathematically as:

Let C represents predefined categories C = {c1, c2, …, ck} and D = {d1, d2, …, dn} represents the number of web pages or documents need to be classified. The decision matrix be Z=DxC where each entry represents either belonging to a set {0, 1} where 1 indicates the document di belonging to category ci, and 0 indicates not belonging to the category. A document can belong to more than one category. Web page classification means approximating function f: DxC → {0, 1} by a learned function called a classifier 978-1-4666-9767-6.ch002.m01 both the functions closely match each other. The 978-1-4666-9767-6.ch002.m02 is acquired by machine learning over training examples; each training example has a label of category to which it belongs. The function 978-1-4666-9767-6.ch002.m03 is used during training and the classification of web pages. The decision matrix is given in Table 1.

Table 1.
Decision matrix for classifying n documents in k classes
C1C2Ck
D1Z11Z12Z1k
D2Z21Z22Z2k
Dnzn1Zn2znk
Top

Basic Approaches Of Web Page Classification

Web page classification is a supervised machine-learning problem in which a web page is categorized using a trained classifier. The web pages are written in HTML and are semi structured in nature. They are connected through hyperlinks forming a directed graph. The data on web is frequent, non-homogeneous and vigorously changing.

Complete Chapter List

Search this Book:
Reset