Next Generation Search Engines: Advanced Models for Information Retrieval

Next Generation Search Engines: Advanced Models for Information Retrieval

Christophe Jouis (Universite Paris III, France and LIP6-Universite Pierre et Marie Curie, France), Ismail Biskri (Universite du Quebec A Trois Rivieres, Canada), Jean-Gabriel Ganascia (LIP6 and CNRS-Universite Pierre et Marie Curie, France) and Magali Roux (LIP6 and CNRS-Universite Pierre et Marie Curie, France)
Release Date: March, 2012|Copyright: © 2012 |Pages: 560
ISBN13: 9781466603301|ISBN10: 1466603305|EISBN13: 9781466603318|DOI: 10.4018/978-1-4666-0330-1


Recent technological progress in computer science, Web technologies, and the constantly evolving information available on the Internet has drastically changed the landscape of search and access to information. Current search engines employ advanced techniques involving machine learning, social networks, and semantic analysis.

Next Generation Search Engines: Advanced Models for Information Retrieval is intended for scientists and decision-makers who wish to gain working knowledge about search in order to evaluate available solutions and to dialogue with software and data providers. The book aims to provide readers with a better idea of the new trends in applied research.

Topics Covered

The many academic areas covered in this publication include, but are not limited to:

  • Artificial Intelligence (AI) Enabled Search Engines
  • Clustering
  • Context-aware system, Mobile search engine
  • Crosslingual search
  • Customisation and Information retrieval
  • Electronic discovery and legal search
  • Human-centred search, visualization
  • Index Design, Index Compression
  • Information seeking and use, information behaviour
  • Metadata, e-sciences
  • Mobile search, Personalization
  • Photo database
  • Quality measurement, retrieval effectiveness
  • Question answering
  • Scalability, Distributed Information Retrieval
  • Semantic Search, Linguistic Ontologies
  • Text Mining
  • Web browsers, customization

Reviews and Testimonials

"This book is intended for scientists and decision-makers who wish to gain working knowledge of searches in order to evaluate available solutions and to dialogue with software and data providers. It also targets intranet or Web server designers, developers and administrators who wish to understand how to integrate search technology into their applications according to their needs. This book is further designed for designers, developers and administrators of databases, groupware applications and document management systems (EDM), as well as directors of libraries or documentation centers who seek a deeper understanding of the tools they use, and how to set up new information systems. Lastly, this book is aimed at all professionals in technology or competitive intelligence and, more generally, the specialists of the information market."

– Christophe Jouis, University Paris Sorbonne Nouvelle and LIP6 (UPMC & CNRS), France; Ismaïl Biskri, University of Quebec at Trois Rivieres, Canada; Jean-Gabriel Ganascia, LIP6, (UPMC & CNRS), France; and Magali Roux, INIST and LIP6, (UPMC &

This collection of 20 essays explores next-generation search engines, covering topics such as indexing, metadata, semantic models and search-engine interfaces. The essays offer a truly international perspective on developments in next-generation search engines. This book is comprehensively researched and detailed and is recommended for those involved in the development and implementation of information-retrieval models. It is also likely to be of interest to those wishing to learn more about next-generation search engines and the current trends in information-retrieval models.

– The Australian Library Journal, Vol. 62, No. 2 - Anne Sara, Sydney

This work is abundant in innovative ideas, new concepts, and real-world practices. The realization or implementation of the perspectives and models included in the book may result in real advances and substantive changes in information markets. Advanced students, academics and researchers, as well as knowledge workers, information and computer scientists and web designers, can benefit from reading this work.

– Alireza Isfandyari-Moghaddam, Islamic Azad University, Hamedan Branch, Online Information Review, Vol. 37, No. 3

Table of Contents and List of Contributors

Search this Book:
Editorial Advisory Board
Table of Contents
Chapter 1
Abhishek Das, Ankit Jain
In this chapter, the authors describe the key indexing components of today’s web search engines. As the World Wide Web has grown, the systems and... Sample PDF
Indexing the World Wide Web: The Journey So Far
Chapter 2
Weimao Ke
Amid the rapid growth of information today is the increasing challenge for people to navigate its magnitude. Dynamics and heterogeneity of large... Sample PDF
Decentralized Search and the Clustering Paradox in Large Scale Information Networks
Chapter 3
Magali Roux
E-sciences are data-intensive sciences that make a large use of the Web to share, collect, and process data. In this context, primary scientific... Sample PDF
Metadata for Search Engines: What can be learned from e-Sciences?
Chapter 4
Christian Fluhr
This paper is about search of photos in photo databases of agencies which sell photos over the Internet. The problem is far from the behavior of... Sample PDF
Crosslingual Access to Photo Databases
Chapter 5
Hanêne Ghorbel, Afef Bahri, Rafik Bouaziz
The unstructured design of Web resources favors human comprehension, but makes difficult the automatic exploitation of the contents of these... Sample PDF
Fuzzy Ontologies Building Platform for Semantic Web: FOB Platform
Chapter 6
Brahim Djioua, Jean-Pierre Desclés, Motasem Alrahabi
A new model is proposed to retrieve information by building automatically a semantic metatext1 structure for texts that allow searching and... Sample PDF
Searching and Mining with Semantic Categories
Chapter 7
Edmond Lassalle, Emmanuel Lassalle
Robertson and Spärck Jones pioneered experimental probabilistic models (Binary Independence Model) with both a typology generalizing the Boolean... Sample PDF
Semantic Models in Information Retrieval
Chapter 8
Michael W. Berry, Reed Esau, Bruce Kiefer
Electronic discovery (eDiscovery) is the process of collecting and analyzing electronic documents to determine their relevance to a legal matter.... Sample PDF
The Use of Text Mining Techniques in Electronic Discovery for Legal Matters
Chapter 9
Mona Sleem-Amer, Ivan Bigorgne, Stéphanie Brizard, Leeley Daio Pires Dos Santos, Yacine El Bouhairi, Bénédicte Goujon, Stéphane Lorin, Claude Martineau, Loïs Rigouste, Lidia Varga
Over the last years, research and industry players have become increasingly interested in analyzing opinions and sentiments expressed on the social... Sample PDF
Intelligent Semantic Search Engines for Opinion and Sentiment Mining
Chapter 10
Human-Centred Web Search  (pages 217-238)
Orland Hoeber
People commonly experience difficulties when searching the Web, arising from an incomplete knowledge regarding their information needs, an inability... Sample PDF
Human-Centred Web Search
Chapter 11
Sarah Vert
This chapter focuses on the Internet working environment of Knowledge Workers through the customization of the Web browser on their computer. Given... Sample PDF
Extensions of Web Browsers useful to Knowledge Workers
Chapter 12
Lin-Chih Chen
Result clustering has recently attracted a lot of attention to provide the users with a succinct overview of relevant search results than... Sample PDF
Next Generation Search Engine for the Result Clustering Technology
Chapter 13
Ismaïl Biskri, Louis Rompré
In this paper the authors will present research on the combination of two methods of data mining: text classification and maximal association rules.... Sample PDF
Using Association Rules for Query Reformulation
Chapter 14
Question Answering  (pages 304-343)
Ivan Habernal, Miloslav Konopík, Ondrej Rohlík
Question Answering is an area of information retrieval with the added challenge of applying sophisticated techniques to identify the complex... Sample PDF
Question Answering
Chapter 15
Brigitte Grau
This chapter is dedicated to factual question answering, i.e., extracting precise and exact answers to question given in natural language from... Sample PDF
Finding Answers to Questions, in Text Collections or Web, in Open Domain or Specialty Domains
Chapter 16
Jawad Berri, Rachid Benlamri
Exploiting context information in a web search engine helps fine-tuning web services and applications to deliver custom-made information to end... Sample PDF
Context-Aware Mobile Search Engine
Chapter 17
Ourdia Bouidghaghen, Lynda Tamine
The explosion of the information available on the Internet has made traditional information retrieval systems, characterized by one size fits all... Sample PDF
Spatio-Temporal Based Personalization for Mobile Search
Chapter 18
Stéphane Chaudiron, Madjid Ihadjadene
This chapter shows that the wider use of Web search engines, reconsidering the theoretical and methodological frameworks to grasp new information... Sample PDF
Studying Web Search Engines from a User Perspective: Key Concepts and Main Approaches
Chapter 19
Faruk Karaman
Search engines are the major means of information retrieval over the Internet. People’s dependence on them increases over time as SEs introduce new... Sample PDF
Artificial Intelligence Enabled Search Engines (AIESE) and the Implications
Chapter 20
Dirk Lewandowski
This chapter presents a theoretical framework for evaluating next generation search engines. The author focuses on search engines whose results... Sample PDF
A Framework for Evaluating the Retrieval Effectiveness of Search Engines
About the Contributors



Scientific and economic organizations are confronted with handling an abundance of strategic information in their domain activities. One main challenge is to be able to find the right information quickly and accurately. In order to do so, organizations must master information access: getting relevant query results that are organized, sorted, and actionable.

As noted by Mukhopadhyay and Mukhopadhyay (2004), almost everyone agrees that in the current state of the art on Internet search engine technology, extracting information from the Web is an art itself. Almost all commercial search engines use classical keyword-based methods for information retrieval (IR). That means that they try to match user specified patterns (i.e., queries) to the texts of all documents in their database and then return the documents that contain terms matching the query. Such methods are quite effective for well-controlled collections - such as bibliographic CD-ROMs or handcrafted scientific information repositories. Unfortunately the organization of the Internet has not been rationally supervised, but it has rather spontaneously evolved and, therefore, cannot be treated as a well-controlled collection. It contains a lot of garbage and redundant information and, what is maybe even more important, it does not rely on any underlying semantic structure intended to facilitate navigation.

In addition, some of the current issues result from inappropriate query constructions. The user queries that are usually submitted to search engines are often too general (like “water sources” or “capitals”) and this produces millions of returned documents. The results, which are of interest to users, are probably among them, but they cannot be distinguished from the mass; it appears impossible to emphasize them to the human attention. One hundred documents are generally regarded as the maximum amount of information that can be useful to users in such situations.

On the other hand, some documents cannot be retrieved because the specified pattern does not exactly match. This can be caused by flexion in some languages, or by confusion introduced by synonyms and complex idiom structures (e.g., in English the word Mike is often given as an example of this, as it can be used as a male name or as a shortened form for the noun “microphone”). Most search engines have also very poor user interfaces. Computer-aided query constructions are very rare and the presentation of the search results concentrates mostly on individual documents, but it does not provide any general overview of retrieved data, which is crucial when the number of returned documents is huge. A last group of problems comes from the nature of information stored on the Internet. Search tools must not only deal with hypertext documents (in the form of WWW pages) but also with text repositories (message archives, e-books etc.), FTP and Usenet servers and with many sources of non-textual information such as audio, video, and interactive contents.

Recent technological progress in computer science, Web technologies, and constantly evolving information available on the Internet has drastically changed the landscape of search and access to information. Web search has significantly evolved in recent years. In the beginning, web search engines such as Google and Yahoo! were only providing search service over text documents. Aggregated search was one of the first steps to go beyond text search, and was the beginning of a new era for information seeking and retrieval. These days, new web search engines support aggregated search over a number of vertices, and blend different types of documents (e.g., images, videos) in their search results. New search engines employ advanced techniques involving machine learning, computational linguistics and psychology, user interaction and modeling, information visualization, Web engineering, artificial intelligence, distributed systems, social networks, statistical analysis, semantic analysis, and technologies over query sessions.

Documents no longer exist on their own; they are connected to other documents, they are associated with users and their position in a social network, and they can be mapped onto a variety of ontologies. Similarly, retrieval tasks have become more interactive and are solidly embedded in a user's geospatial, social, and historical context. It is conjectured that new breakthroughs in information retrieval will not come from smarter algorithms that better exploit existing information sources, but from new retrieval algorithms that can intelligently use and combine new sources of contextual metadata.

With the rapid growth of web-based applications, such as search engines, Facebook, and Twitter, the development of effective and personalized information retrieval techniques and of user interfaces is essential. The amount of shared information and of social networks has also considerably grown, requiring metadata for new sources of information, like Wikipedia and ODP. These metadata have to provide classification information for a wide range of topics, as well as for social networking sites like Twitter, and Facebook, each of which provides additional preferences, tagging information and social contexts. Due to the explosion of social networks and other metadata sources, it is an opportune time to identify ways to exploit such metadata in IR tasks such as user modeling, query understanding, and personalization, to name a few. Although the use of traditional metadata such as html text, web page titles, and anchor text is fairly well-understood, the use of category information, user behavior data, and geographical information is just beginning to be studied.


The main goal of this book is to transfer new research results from the fields of advanced computer sciences and information science to the design of new search engines. The readers will have a better idea of the new trends in applied research. The achievement of relevant, organized, sorted, and workable answers – to name but a few – from a search is becoming a daily need for enterprises and organizations, and, to a greater extent, for anyone. It does not consist of getting access to structural information as in standard databases; nor does it consist of searching information strictly by way of a combination of key words. It goes far beyond that. Whatever its modality, the information sought should be identified by the topics it contains, that is to say by its textual, audio, video or graphical contents. This is not a new issue. However, recent technological advances have completely changed the techniques being used. New Web technologies, the emergence of Intranet systems and the abundance of information on the Internet have created the need for efficient search and information access tools.


This book is intended for scientists and decision-makers who wish to gain working knowledge of searches in order to evaluate available solutions and to dialogue with software and data providers. It also targets intranet or Web server designers, developers and administrators who wish to understand how to integrate search technology into their applications according to their needs. This book is further designed for designers, developers and administrators of databases, groupware applications and document management systems (EDM), as well as directors of libraries or documentation centers who seek a deeper understanding of the tools they use, and how to set up new information systems. Lastly, this book is aimed at all professionals in technology or competitive intelligence and, more generally, the specialists of the information market.


The book is divided into four sections.

Section 1 is “Indexation”. The goal of automatic indexing is to establish an index for a set of documents that has to facilitate future access to documents and to their content. Usually, an index is composed of a list of descriptors, each of them being associated to a list of documents and/or of parts of documents to which it refers. In addition, theses references may be weighted. When searching to answer the users' queries, the system looks for a list of answers, of which an index is as close as possible to the demand. As a consequence, indexation could be seen as a required preliminary to intelligent information retrieval, since it pre-structures textual data according to topic, domain, keyword or center of interest.

Section 2 is “Data Mining for Information Retrieval”. Data Mining (i.e., Knowledge Discovery from Data Bases) is the process of automatically extracting meaningful, useful, previously unknown and ultimately comprehensible patterns from large data sets. Data mining is a relatively young and interdisciplinary field that combines methods from statistics and artificial intelligence with database management. With the considerable increase of processing power, storage capacities, and inter-connectivity of computer technology, in particular with the grid computation, data mining is now seen as an increasingly important field by modern business for transforming unprecedented quantities of digital data into new knowledge that provides a significant competitive advantage. This is now a large part of what people refer to as business intelligence strategy. It is currently used in a wide range of profiling practices, such as marketing, surveillance, fraud detection, and scientific discovery. The growing consensus that data mining can bring real added value has led to an explosion in demand for novel data mining technologies.

Section 3 is “Interface”. The term "interface" refers to the part of the search engine in which (1) the user formulates his request and (2) the user reads the results. The interface is then seen in four views: Human-centered Web Search, Personalization, Question/Answering, and Mobile Search Engines. “Human-centered Web Search” is understood to be how Web search engines help people to find the information they are seeking. “Personalization” takes keywords from the user as an expression of their information need, but also uses additional information about the user (such as their preferences, community, location or history) to assist in determining the relevance of pages. “Question/Answering” addresses the problem of finding answers to questions posed in natural language; answering is the task which, when given a query in natural language, aims at finding one or more concise answers in the form of sentences or phrases. “Mobile Search Engines” may be defined as the combining of search technologies and knowledge about the user context in his mobile environment into a single framework in order to provide the most appropriate answer for users information needs.

Finally, Section 4 is “Evaluation”. Evaluation means two things: (1) tracing the users' behaviors, with a special attention to the concept of “information practice” and other related concepts such as “use”, “activity”, and “behavior” largely used in the literature but not always strictly defined, the aim being to place the users and their needs at the center of the design process; (2) evaluating the next generation search engines with four main criteria for improving the quality of the search results: index quality, quality of the results, quality of search features, and search engine usability.

Christophe Jouis, University Paris Sorbonne Nouvelle and LIP6 (UPMC & CNRS), France
Ismaïl Biskri, University of Quebec at Trois Rivieres, Canada
Jean-Gabriel Ganascia, LIP6, (UPMC & CNRS), France
Magali Roux, INIST and LIP6, (UPMC & CNRS), France


Mukhopadhyay, B., & Mukhopadhyay, S. (2004, February 11-13). Data mining techniques for information retrieval. In Proceedings of the 2nd International Conference of the Convention on Automation of Libraries in Education and Research Institution, New Delhi, India (p. 506).

Author(s)/Editor(s) Biography

Christophe Jouis is assistant professor at the University Paris Sorbonne Nouvelle, France. He received a Ph.D. in Applied Mathematics at the “Ecole des Hautes Etudes en Sciences Sociales” (EHESS); and CAMS (“Centre d’Analyse et de Mathématiques Sociales”), OPTION: Science, Logic, Linguistics. From 2000 to 2004 he was associate professor in the Department of Computer Science at the University of Quebec at Trois-Rivieres (Canada), under the direction of Professor Ismail Biskri. In 2005, he joined the LIP6 ("Laboratoire d'Informatique de Paris 6), affiliated with the University Pierre et Marie Curie (UMPC) and the CNRS (France). Within the LIP6, he is currently a member of the research team ACASA (“Cognitive Agents and Automated Symbolic Learning”), led by Professor Jean-Gabriel Ganascia. His research interests are in natural language processing (NLP), cognitive sciences, ontology, typicality, data mining and information retrieval.
Ismaïl Biskri is full professor in computational linguistics and artificial intelligence at the computer science department of the University of Quebec at Trois-Rivières. He is also associate professor at the Computer Science Department of the University of Quebec at Montreal. He is a researcher at the LAMIA Laboratory. His research interests concern aspects of fundamental research on the syntactic and functional semantic analysis of natural languages with using models of Categorial Grammars and combinatory logic. He also works on specific issues in text-mining, information retrieval, and terminology. His research is funded by the Canadian granting agencies FQRSC, SSHRC, and NSERC.
Jean-Gabriel Ganascia is presently Professor of computer science at Paris University Pierre et Marie Curie (Paris VI) and researcher at the computer science laboratory of Paris VI University (LIP6) where he leads the ACASA (“Cognitive Agents and Automated Symbolic Learning”) team. He originally worked on symbolic machine learning and knowledge engineering. His “thèse d'état”, defended in 1987, was a pioneering work on the algebraic framework on which the association rule extraction techniques are based. Today, his main scientific interests cover different areas of artificial intelligence: scientific discovery, cognitive modeling, data-mining, and digital humanities. He has published more than 350 scientific papers in conference proceedings, journals, and books. In the past, Jean-Gabriel Ganascia was also program leader in the CNRS executive from 1988 to 1992 before moving to direct the Cognitive Science Coordinated Research Program and head the Cognition Sciences Scientific Interest Group from 1993 until 2000.
Magali Roux is a CNRS Research Director involved in the development and administration of programs and courses in e-Biology. Her research interests span a wide range with domains centered on knowledge organization and data management in Medical Biology, Molecular Biology and, recently, in Systems Biology in the context of e-Sciences. After obtaining her Ph.D. in Biochemistry from the University of the Mediterranean in 1979, she started as assistant-professor at the Marseille University Hospital before being offered a post-doctoral position at Harvard University in the Pr. J. Strominger laboratory, where she provided one of the first bioinformatics analyses performed on DNA data. Since that, she has produced leading contributions in the fields of Immunology and Cancer. In the early 2000s, she moved from Experimental to Digital Biology to promote interoperability, data sharing and re-use. Dr. Roux serves on numerous study panels and is currently active in a number of scientific societies.


Editorial Board

  • Berry, Michael W., University of Tennessee, USA
  • Biskri, Ismaïl, Professor, Université du Québec à Trois-Rivières, Québec, Canada
  • Boughanem, Mohand, Université Paul Sabatier, France
  • Bourdaillet, Julien, Université de Montréal, Québec, Canada
  • Bourdoncle, François, EXALEAD, France
  • Chailloux, Jérôme, ERCIM (European Research Consortium for Informatics and Mathematics), France
  • Chaudiron, Stéphane, Université Lille 3, France
  • Constant, Patrick, PERTIMM, France
  • Das, Abhishek, Google Inc., USA
  • Desclés, Jean-Pierre, Université Paris-Sorbonne, France
  • Dulong, Tanneguy, ARISEM (THALES), France
  • Emam, Ossama, Cairo HLT Group IBM, USA
  • Ferret, Olivier, LI2CM/CEA (Laboratoire d’Ingénierie de la Connaissance Multimédia Multilingue/Commissariat à l'Énergie Atomique), France
  • Fluhr, Christian, Cedege/Hossur'Tech, France
  • Fouladi, Karan, LIP6/ UMPC-CNRS (Laboratoire d’Informatique de Paris 6/ Université Pierre et Marie Curie and CNRS), France
  • Gallinari, Patrick, LIP6 (UMPC/CNRS), France
  • Ganascia, Jean-Gabriel, LIP6 (UMPC/CNRS), France
  • Gargouri, Faiez, ISIM (Institut Supérieur d'Informatique et de Multimédia de Sfax), Tunisia
  • Ghitalla, Frank, INIST (Institut de l'Information Scientifique et Technique), France
  • Grau, Brigitte, LIMSI/CNRS (Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur), France
  • Grefenstelle, Gregory, EXALEAD, France
  • Habib, Bassel, LIP6 (UMPC-CNRS), France
  • Jaziri, Wassim, ISIM, Sfax, Tunisia
  • Huot, Charles, TEMIS Group, France
  • Jain, Ankit, Google Inc., USA
  • Jouis, Christophe, Université Paris Sorbonne Nouvelle and LIP6 (UMPC-CNRS), France
  • Lassale, Edmond, Orange Labs (France Telecom), France
  • Le Borgne, Hervé, , LI2CM (CEA), France
  • Lucas, Philippe, TECHNOLOGIES group (Spirit software), France
  • Meng, Fan, University of Michigan, USA
  • Meunier, Jean-Guy, UQAM (Université du Québec à Montréal), Québec, Canada
  • Moulinier, Isabelle, Thomson Reuters, USA
  • Mustafa El-Hadi, IDIST, Universite Lille3, France
  • Nie, Jian-Yun, Université de Montréal, Montreal, Quebec, Canada
  • Piwowarski, Benjamin, Information Retrieval Group, University of Glasgow, UK
  • Poupon, Anne, Equipe Biologie et Bioinformatique des Systèmes de Signalisation Physiologie du Comportement et de la Reproduction, France
  • Riad, Mokadem, IRIT (Institut de Recherche en Informatique de Toulouse), France
  • Robertson, Stephen, Microsoft Research Laboratory in Cambridge, UK
  • Rocca-Serra, Philippe, The European Bioinformatics Institute, EMBL Outstation - Hinxton, Cambridge, UK
  • Roux, Magali, LIP6 (UMPC-CNRS) and INIST, France
  • Shafei, Bilal, ITS – BBE department, Columbia University, USA and An-Najah National University, Palestine
  • Sansone, Susanna-Assunta, The European Bioinformatics Institute, EMBL Outstation - Hinxton , Cambridge, UK
  • Savoy, Jacques, Université de Neuchâtel, Switzerland
  • Smyth, Barry, Professor, University College Dublin, Ireland
  • Stroppa, Nicolas, Yahoo! Labs, France
  • Timimi, Ismaïl, IDIST, Universite Lille 3, France
  • Velcin, Julien, ERIC Lab, University Lyon 2, France
  • Vinot, Romain, Yahoo! Labs in Paris, France
  • Wassermann, Renata, Computer Science Department, University of São Paulo, Brasil
  • Zitouni, Imed, IBM T.J. Watson Research Center, USA