Classifying Web Pages by Aimed Nation Using Machine Learning

Classifying Web Pages by Aimed Nation Using Machine Learning

Boudheb Tarik (Djillali Liabes University of Sidi Bel-Abbes, Sidi Bel-Abbes, Algeria), Djelloul Daouadji Mahmoud (Djillali Liabes University of Sidi Bel-Abbes, Sidi Bel-Abbes, Algeria) and Elberrichi Zakaria (Djillali Liabes University of Sidi Bel-Abbes, Sidi Bel-Abbes, Algeria)
DOI: 10.4018/IJOCI.2017010102


Classifying web pages is to automatically assign predefined class to them. It is one of the main applications of web mining. The authors' aim is to detect the targeted nation based on the web pages content. It is an original application. In this paper, the authors propose different web mining approaches using machine learning algorithms such as Naïve Bayes and Support Vector Machine in order classify them. They present detailed stages of the procedure. The best experimental result based on an original corpus created by their own means shows a very attention grabbing f-score of 85%.
Article Preview

1. Introduction

The recent technological advances in web and storage have led to an explosion of data. Nowadays, internet can be viewed as a colossal database, mainly constituted by unstructured data, such as natural language text, scripts, pictures, movies, audio, etc. As a result, the information retrieval has become more complex and harder.

Web mining is the application of data mining techniques to extract knowledge from Web data, i.e. Web Content, Web Structure and Web Usage data (Lin, & Chu, 2005).

  • 1.

    Web usage mining is the application of data mining techniques to discover interesting usage patterns from Web data, in order to understand and better serve the needs of Web-based applications.

  • 2.

    Web structure mining is the process of discovering structure information from the Web.

  • 3.

    Web content mining is the process of extracting useful information from the contents of Web documents. Content data corresponds to the collection of facts a Web page was designed to convey to the users. It may consist of text, images, audio, video, or structured records such as lists and tables. Application of text mining to Web content has been the most widely researched. Issues addressed in text mining are, topic discovery, extracting association patterns, clustering of web documents and classification of Web Pages. Research activities on this topic have drawn heavily on techniques developed in other disciplines such as Information Retrieval (IR) and Natural Language Processing (NLP) (Lin, & Chu, 2005).

In the latest years, web mining was a major scientific research topic, and web page classification, also known as web page categorization was its main application. We want to perform a special kind of classification, which is by aimed nation such as Algerian, Nigerian, Turkish, Chinese, American, Russian, Canadian, etc.

The remainder of this paper is organized as follows: Section 2, will discuss the problematic and motivations. Section 3, will present the state-of-the-art. Section 4, will introduce our proposed approach and Section 5 will present experimentations.


2. Problematic And Objective

The Indexed Web contains at least 4.84 billion pages (Kunder, 01 February 2016). The aim of the researchers is to help find the targeted nation of a web page based on its content. The problem is a complex one given the following constraints:

  • A web page content can be written in different languages such as Arabic, French, Chinese, English, Turkish, Russian, etc.

  • Different countries may use the same language. For example, Arabic countries (Algeria, Tunisia, Egyptian, etc.) use Arabic language.

  • A web page content can be generic.

  • A single country can use multiple languages such as Switzerland.

  • A web page can have a generic tld ('.com', '.org', '.net', etc.).

  • Web site can be hosted anywhere in the world, and created by a foreign person from the aimed nation.

The main motivation of this work is that it will allow:

  • Conceiving focused crawler to index web pages by aimed nation.

  • Increasing the efficiency of search engines when performing better research about a specific nation.

  • Building database of web pages indexed by nation for dedicated research.

  • Broadcasting focused advertisement by country.

  • Performing different specific studies by country.

  • Filtering particular nation web pages.

  • Etc.

Complete Article List

Search this Journal:
Open Access Articles
Volume 11: 4 Issues (2021): Forthcoming, Available for Pre-Order
Volume 10: 4 Issues (2020): 3 Released, 1 Forthcoming
Volume 9: 4 Issues (2019)
Volume 8: 4 Issues (2018)
Volume 7: 4 Issues (2017)
Volume 6: 4 Issues (2016)
Volume 5: 4 Issues (2015)
Volume 4: 4 Issues (2014)
Volume 3: 4 Issues (2012)
Volume 2: 4 Issues (2011)
Volume 1: 4 Issues (2010)
View Complete Journal Contents Listing