Studying and Analysis of a Vertical Web Page Classifier Based on Continuous Learning Naïve Bayes (CLNB) Algorithm
H. A. Ali (Mansoura University, Egypt), Ali I. El Desouky (Mansoura University, Egypt) and Ahmed I. Saleh (Mansoura University, Egypt)
Copyright: © 2009
Web page classification is considered one of the most challenging research areas. Where the Web has a huge volume of unstructured and distributed documents that are related to a variety of domains; so, considering one base for the classification tasks will be extremely difficult. In addition, the Web is full of noise that will certainly harm the classifier performance especially if it is found in the classifier training data. Generally, it will be more valued to build domain-oriented classifiers (vertical classifi- ers) to classify pages related to a specific domain and compensate those classifiers with novel learning techniques to achieve better performance. The contribution of this paper is three edged; firstly, a novel learning technique called .Continuous Learning. is introduced. Secondly, the paper presents a new trend for Web page classification by presenting the domain-oriented classifiers (vertical classifiers). A new way of applying Bayes and K-Nearest Neighbor algorithms is introduced in order to build Domain Oriented Naïve Bayes (DONB) and Domain Oriented K-Nearest Neighbor (DOKNN) classifiers. The third contribution is combining both disciplines by introducing a novel classification strategy. Such strategy adds the continuous learning ability to Bayes theorem to build a Continuous learning domain oriented Naïve Bayes (CLNB) classifier. Where the overfitting problem has a great impact on most Web page classification techniques, continuous learning can be considered as a proposed solution. It allows the classifier to adapt itself continuously for achieving better performance. The proposed classifiers are tested; experimental results have shown that CLNB demonstrates significant performance improvement over both DONB and DOKNN where its accuracy goes beyond 94.1% after testing 1000 pages.