Web Mining in Thematic Search Engines

Web Mining in Thematic Search Engines

Massimiliano Caramia (Istituto per le Applicazioni del Calcolo IAC-CNR, Italy) and Giovanni Felici (Istituto di Analisi dei Sistemi ed Informatica (IASI-CNR), Italy)
Copyright: © 2005 |Pages: 5
DOI: 10.4018/978-1-59140-557-3.ch226
OnDemand PDF Download:


The recent improvements of search engine technologies have made available to Internet users an enormous amount of knowledge that can be accessed in many different ways. The most popular search engines now provide search facilities for databases containing billions of Web pages, where queries are executed instantly. The focus is switching from quantity (maintaining and indexing large databases of Web pages and quickly selecting pages matching some criterion) to quality (identifying pages with a high quality for the user). Such a trend is motivated by the natural evolution of Internet users who are now more selective in their choice of the search tool and may be willing to pay the price of providing extra feedback to the system and to wait more time for their queries to be better matched. In this framework, several have considered the use of data-mining and optimization techniques, which are often referred to as Web mining (for a recent bibliography on this topic, see, e.g., Getoor, Senator, Domingos & Faloutsos, 2003), and Zaïane, Srivastava, Spiliopoulou, & Masand, 2002). Here, we describe a method for improving standard search results in a thematic search engine, where the documents and the pages made available are restricted to a finite number of topics, and the users are considered to belong to a finite number of user profiles. The method uses clustering techniques to identify, in the set of pages resulting from a simple query, subsets that are homogeneous with respect to a vectorization based on context or profile; then we construct a number of small and potentially good subsets of pages, extracting from each cluster the pages with higher scores. Operating on these subsets with a genetic algorithm, we identify the subset with a good overall score and a high internal dissimilarity. This provides the user with a few nonduplicated pages that represent more correctly the structure of the initial set of pages. Because pages are seen by the algorithms as vectors of fixed dimension, the role of the context- or profile-based vectorization is central and specific to the thematic approach of this method.

Complete Chapter List

Search this Book: