Feature Selection for Web Page Classification

Feature Selection for Web Page Classification

K. Selvakuberan (Tata Consultancy Services, India), M. Indra Devi (Thiagarajar College of Engineering, India) and R. Rajaram (Thiagarajar College of Engineering, India)
DOI: 10.4018/978-1-60566-196-4.ch012
OnDemand PDF Download:
$30.00
List Price: $37.50

Abstract

The World Wide Web serves as a huge, widely distributed, global information service center for news, advertisements, customer information, financial management, education, government, e-commerce and many others. The Web contains a rich and dynamic collection of hyperlink information. The Web page access and usage information provide rich sources for data mining. Web pages are classified based on the content and/or contextual information embedded in them. As the Web pages contain many irrelevant, infrequent, and stop words that reduce the performance of the classifier, selecting relevant representative features from the Web page is the essential preprocessing step. This provides secured accessing of the required information. The Web access and usage information can be mined to predict the authentication of the user accessing the Web page. This information may be used to personalize the information needed for the users and to preserve the privacy of the users by hiding the personal details. The issue lies in selecting the features which represent the Web pages and processing the details of the user needed the details. In this chapter we focus on the feature selection, issues in feature selection, and the most important feature selection techniques described and used by researchers.
Chapter Preview
Top

Literature Survey

Rudy Setiono and Huan Liu (1997) proposed that Discretization can turn numeric attributes into discrete ones. χ2 is a simple algorithm. Principal Component Analysis-compose a small number of new features. It is improved from simple methods such as equi-width and equal frequency intervals. For each and every attributes calculate the χ2 value for each and every interval. Combine the lowest interval values while approximation.

Shounak Roychowdhury (2001) proposed a technique called granular computing for processing and expressing chunks of information called granules. It reduces hypothesis search space, to reduce storage. Fuzzy set based feature elimination techniques in which subset generation and subset evaluation are employed. For optimal feature selection brute force technique is employed.

Catherine Blake and Wander Pratt (2001) suggested the relationship between the features used to represent the text and the quality model. A comparison of association rules based on three different concepts: words, manually assigned keywords, automatically assigned concepts are made. Bidirectional association rules on concepts or keywords are useful than the words used. Each individual feature should be informative. The quality of features should be meaningful. The concepts and keywords also represent fewer 90% of the words used in the medical diagnosis.

Complete Chapter List

Search this Book:
Reset