Next Generation Search Engines: Advanced Models for Information Retrieval
Christophe Jouis (Universite Paris III, France and LIP6-Universite Pierre et Marie Curie, France), Ismail Biskri (Universite du Quebec A Trois Rivieres, Canada), Jean-Gabriel Ganascia (LIP6 and CNRS-Universite Pierre et Marie Curie, France) and Magali Roux (LIP6 and CNRS-Universite Pierre et Marie Curie, France)
Release Date: March, 2012. Copyright © 2012. 560 pages.
You must have an IGI Global account before adding an e-book to your shopping cart.
In Stock. Have it as soon as May. 29 with express shipping*.
DOI: 10.4018/978-1-4666-0330-1, ISBN13: 9781466603301, ISBN10: 1466603305, EISBN13: 9781466603318
Recent technological progress in computer science, Web technologies, and the constantly evolving information available on the Internet has drastically changed the landscape of search and access to information. Current search engines employ advanced techniques involving machine learning, social networks, and semantic analysis.
Next Generation Search Engines: Advanced Models for Information Retrieval is intended for scientists and decision-makers who wish to gain working knowledge about search in order to evaluate available solutions and to dialogue with software and data providers. The book aims to provide readers with a better idea of the new trends in applied research.
Table of Contents and List of Contributors
Search this Book:
Abhishek Das (Google Inc., USA), Ankit Jain (Google Inc., USA)
In this chapter, the authors describe the key indexing components of today’s web search engines. As the World Wide Web has grown, the systems and methods for indexin...
Christian Fluhr (GEOL Semantics, France)
This paper is about search of photos in photo databases of agencies which sell photos over the Internet. The problem is far from the behavior of photo databases mana...
Hanêne Ghorbel (University of Sfax, Tunisia), Afef Bahri (University of Sfax, Tunisia), Rafik Bouaziz (University of Sfax, Tunisia)
The unstructured design of Web resources favors human comprehension, but makes difficult the automatic exploitation of the contents of these resources by machines. S...
Brahim Djioua (University of Paris-Sorbonne, France), Jean-Pierre Desclés (University of Paris-Sorbonne, France), Motasem Alrahabi (University of Paris-Sorbonne, France)
A new model is proposed to retrieve information by building automatically a semantic metatext1 structure for texts that allow searching and extracting discourse and...
Edmond Lassalle (Oranges Labs, France), Emmanuel Lassalle (Université Paris 7, France)
Robertson and Spärck Jones pioneered experimental probabilistic models (Binary Independence Model) with both a typology generalizing the Boolean model, a frequency c...
Michael W. Berry (University of Tennessee, USA), Reed Esau (Catalyst Repository Systems, USA), Bruce Kiefer (Catalyst Repository Systems, USA)
Electronic discovery (eDiscovery) is the process of collecting and analyzing electronic documents to determine their relevance to a legal matter. Office technology h...
Mona Sleem-Amer (Pertimm, France), Ivan Bigorgne (Lutin, France), Stéphanie Brizard (Arisem, France), Leeley Daio Pires Dos Santos (EDF, France), Yacine El Bouhairi (Thales, France), Bénédicte Goujon (Thales, France), Stéphane Lorin (Thales, France), Claude Martineau (LIGM, France), Loïs Rigouste (Pertimm, France), Lidia Varga (LIGM, France)
Over the last years, research and industry players have become increasingly interested in analyzing opinions and sentiments expressed on the social media web for pro...
Orland Hoeber (Memorial University of Newfoundland, Canada)
People commonly experience difficulties when searching the Web, arising from an incomplete knowledge regarding their information needs, an inability to formulate acc...
Sarah Vert (Centre Virtuel de la Connaissance sur l’Europe (CVCE), Luxembourg)
This chapter focuses on the Internet working environment of Knowledge Workers through the customization of the Web browser on their computer. Given that a Web browse...
Ismaïl Biskri (University of Quebec at Trois-Rivieres, Canada), Louis Rompré (University of Quebec at Montreal, Canada)
In this paper the authors will present research on the combination of two methods of data mining: text classification and maximal association rules. Text classificat...
Ivan Habernal (University of West Bohemia, Czech Republic), Miloslav Konopík (University of West Bohemia, Czech Republic), Ondrej Rohlík (University of West Bohemia, Czech Republic)
Question Answering is an area of information retrieval with the added challenge of applying sophisticated techniques to identify the complex syntactic and semantic r...
Jawad Berri (College of Computing and Information Sciences, King Saud University, Saudi Arabia), Rachid Benlamri (Lakehead University, Canada)
Exploiting context information in a web search engine helps fine-tuning web services and applications to deliver custom-made information to end users. While context,...
Ourdia Bouidghaghen (IRIT-CNRS-University Paul Sabatier of Toulouse, France), Lynda Tamine (IRIT-CNRS-University Paul Sabatier of Toulouse, France)
The explosion of the information available on the Internet has made traditional information retrieval systems, characterized by one size fits all approaches, less ef...
Reviews and Testimonials
"This book is intended for scientists and decision-makers who wish to gain working knowledge of searches in order to evaluate available solutions and to dialogue with software and data providers. It also targets intranet or Web server designers, developers and administrators who wish to understand how to integrate search technology into their applications according to their needs. This book is further designed for designers, developers and administrators of databases, groupware applications and document management systems (EDM), as well as directors of libraries or documentation centers who seek a deeper understanding of the tools they use, and how to set up new information systems. Lastly, this book is aimed at all professionals in technology or competitive intelligence and, more generally, the specialists of the information market."
– Christophe Jouis, University Paris Sorbonne Nouvelle and LIP6 (UPMC & CNRS), France; Ismaïl Biskri, University of Quebec at Trois Rivieres, Canada; Jean-Gabriel Ganascia, LIP6, (UPMC & CNRS), France; and Magali Roux, INIST and LIP6, (UPMC &
- Artificial Intelligence (AI) Enabled Search Engines
- Context-aware system, Mobile search engine
- Crosslingual search
- Customisation and Information retrieval
- Electronic discovery and legal search
- Human-centred search, visualization
- Index Design, Index Compression
- Information seeking and use, information behaviour
- Metadata, e-sciences
- Mobile search, Personalization
- Photo database
- Quality measurement, retrieval effectiveness
- Question answering
- Scalability, Distributed Information Retrieval
- Semantic Search, Linguistic Ontologies
- Text Mining
- Web browsers, customization
NEEDS AND REQUIREMENTS FOR INFORMATION RETRIEVAL
Scientific and economic organizations are confronted with handling an abundance of strategic information in their domain activities. One main challenge is to be able to find the right information quickly and accurately. In order to do so, organizations must master information access: getting relevant query results that are organized, sorted, and actionable.
As noted by Mukhopadhyay and Mukhopadhyay (2004), almost everyone agrees that in the current state of the art on Internet search engine technology, extracting information from the Web is an art itself. Almost all commercial search engines use classical keyword-based methods for information retrieval (IR). That means that they try to match user specified patterns (i.e., queries) to the texts of all documents in their database and then return the documents that contain terms matching the query. Such methods are quite effective for well-controlled collections - such as bibliographic CD-ROMs or handcrafted scientific information repositories. Unfortunately the organization of the Internet has not been rationally supervised, but it has rather spontaneously evolved and, therefore, cannot be treated as a well-controlled collection. It contains a lot of garbage and redundant information and, what is maybe even more important, it does not rely on any underlying semantic structure intended to facilitate navigation.
In addition, some of the current issues result from inappropriate query constructions. The user queries that are usually submitted to search engines are often too general (like “water sources” or “capitals”) and this produces millions of returned documents. The results, which are of interest to users, are probably among them, but they cannot be distinguished from the mass; it appears impossible to emphasize them to the human attention. One hundred documents are generally regarded as the maximum amount of information that can be useful to users in such situations.
On the other hand, some documents cannot be retrieved because the specified pattern does not exactly match. This can be caused by flexion in some languages, or by confusion introduced by synonyms and complex idiom structures (e.g., in English the word Mike is often given as an example of this, as it can be used as a male name or as a shortened form for the noun “microphone”). Most search engines have also very poor user interfaces. Computer-aided query constructions are very rare and the presentation of the search results concentrates mostly on individual documents, but it does not provide any general overview of retrieved data, which is crucial when the number of returned documents is huge. A last group of problems comes from the nature of information stored on the Internet. Search tools must not only deal with hypertext documents (in the form of WWW pages) but also with text repositories (message archives, e-books etc.), FTP and Usenet servers and with many sources of non-textual information such as audio, video, and interactive contents.
Recent technological progress in computer science, Web technologies, and constantly evolving information available on the Internet has drastically changed the landscape of search and access to information. Web search has significantly evolved in recent years. In the beginning, web search engines such as Google and Yahoo! were only providing search service over text documents. Aggregated search was one of the first steps to go beyond text search, and was the beginning of a new era for information seeking and retrieval. These days, new web search engines support aggregated search over a number of vertices, and blend different types of documents (e.g., images, videos) in their search results. New search engines employ advanced techniques involving machine learning, computational linguistics and psychology, user interaction and modeling, information visualization, Web engineering, artificial intelligence, distributed systems, social networks, statistical analysis, semantic analysis, and technologies over query sessions.
Documents no longer exist on their own; they are connected to other documents, they are associated with users and their position in a social network, and they can be mapped onto a variety of ontologies. Similarly, retrieval tasks have become more interactive and are solidly embedded in a user's geospatial, social, and historical context. It is conjectured that new breakthroughs in information retrieval will not come from smarter algorithms that better exploit existing information sources, but from new retrieval algorithms that can intelligently use and combine new sources of contextual metadata.
With the rapid growth of web-based applications, such as search engines, Facebook, and Twitter, the development of effective and personalized information retrieval techniques and of user interfaces is essential. The amount of shared information and of social networks has also considerably grown, requiring metadata for new sources of information, like Wikipedia and ODP. These metadata have to provide classification information for a wide range of topics, as well as for social networking sites like Twitter, and Facebook, each of which provides additional preferences, tagging information and social contexts. Due to the explosion of social networks and other metadata sources, it is an opportune time to identify ways to exploit such metadata in IR tasks such as user modeling, query understanding, and personalization, to name a few. Although the use of traditional metadata such as html text, web page titles, and anchor text is fairly well-understood, the use of category information, user behavior data, and geographical information is just beginning to be studied.
OBJECTIVES OF THE BOOK
The main goal of this book is to transfer new research results from the fields of advanced computer sciences and information science to the design of new search engines. The readers will have a better idea of the new trends in applied research. The achievement of relevant, organized, sorted, and workable answers – to name but a few – from a search is becoming a daily need for enterprises and organizations, and, to a greater extent, for anyone. It does not consist of getting access to structural information as in standard databases; nor does it consist of searching information strictly by way of a combination of key words. It goes far beyond that. Whatever its modality, the information sought should be identified by the topics it contains, that is to say by its textual, audio, video or graphical contents. This is not a new issue. However, recent technological advances have completely changed the techniques being used. New Web technologies, the emergence of Intranet systems and the abundance of information on the Internet have created the need for efficient search and information access tools.
This book is intended for scientists and decision-makers who wish to gain working knowledge of searches in order to evaluate available solutions and to dialogue with software and data providers. It also targets intranet or Web server designers, developers and administrators who wish to understand how to integrate search technology into their applications according to their needs. This book is further designed for designers, developers and administrators of databases, groupware applications and document management systems (EDM), as well as directors of libraries or documentation centers who seek a deeper understanding of the tools they use, and how to set up new information systems. Lastly, this book is aimed at all professionals in technology or competitive intelligence and, more generally, the specialists of the information market.
A BRIEF OVERVIEW OF THE ORGANIZATION OF THE BOOK
The book is divided into four sections.
Section 1 is “Indexation”. The goal of automatic indexing is to establish an index for a set of documents that has to facilitate future access to documents and to their content. Usually, an index is composed of a list of descriptors, each of them being associated to a list of documents and/or of parts of documents to which it refers. In addition, theses references may be weighted. When searching to answer the users' queries, the system looks for a list of answers, of which an index is as close as possible to the demand. As a consequence, indexation could be seen as a required preliminary to intelligent information retrieval, since it pre-structures textual data according to topic, domain, keyword or center of interest.
Section 2 is “Data Mining for Information Retrieval”. Data Mining (i.e., Knowledge Discovery from Data Bases) is the process of automatically extracting meaningful, useful, previously unknown and ultimately comprehensible patterns from large data sets. Data mining is a relatively young and interdisciplinary field that combines methods from statistics and artificial intelligence with database management. With the considerable increase of processing power, storage capacities, and inter-connectivity of computer technology, in particular with the grid computation, data mining is now seen as an increasingly important field by modern business for transforming unprecedented quantities of digital data into new knowledge that provides a significant competitive advantage. This is now a large part of what people refer to as business intelligence strategy. It is currently used in a wide range of profiling practices, such as marketing, surveillance, fraud detection, and scientific discovery. The growing consensus that data mining can bring real added value has led to an explosion in demand for novel data mining technologies.
Section 3 is “Interface”. The term "interface" refers to the part of the search engine in which (1) the user formulates his request and (2) the user reads the results. The interface is then seen in four views: Human-centered Web Search, Personalization, Question/Answering, and Mobile Search Engines. “Human-centered Web Search” is understood to be how Web search engines help people to find the information they are seeking. “Personalization” takes keywords from the user as an expression of their information need, but also uses additional information about the user (such as their preferences, community, location or history) to assist in determining the relevance of pages. “Question/Answering” addresses the problem of finding answers to questions posed in natural language; answering is the task which, when given a query in natural language, aims at finding one or more concise answers in the form of sentences or phrases. “Mobile Search Engines” may be defined as the combining of search technologies and knowledge about the user context in his mobile environment into a single framework in order to provide the most appropriate answer for users information needs.
Finally, Section 4 is “Evaluation”. Evaluation means two things: (1) tracing the users' behaviors, with a special attention to the concept of “information practice” and other related concepts such as “use”, “activity”, and “behavior” largely used in the literature but not always strictly defined, the aim being to place the users and their needs at the center of the design process; (2) evaluating the next generation search engines with four main criteria for improving the quality of the search results: index quality, quality of the results, quality of search features, and search engine usability.
Christophe Jouis, University Paris Sorbonne Nouvelle and LIP6 (UPMC & CNRS), France
Ismaïl Biskri, University of Quebec at Trois Rivieres, Canada
Jean-Gabriel Ganascia, LIP6, (UPMC & CNRS), France
Magali Roux, INIST and LIP6, (UPMC & CNRS), France
Mukhopadhyay, B., & Mukhopadhyay, S. (2004, February 11-13). Data mining techniques for information retrieval. In Proceedings of the 2nd International Conference of the Convention on Automation of Libraries in Education and Research Institution, New Delhi, India (p. 506).
- Berry, Michael W., University of Tennessee, USA
- Biskri, Ismaïl, Professor, Université du Québec à Trois-Rivières, Québec, Canada
- Boughanem, Mohand, Université Paul Sabatier, France
- Bourdaillet, Julien, Université de Montréal, Québec, Canada
- Bourdoncle, François, EXALEAD, France
- Chailloux, Jérôme, ERCIM (European Research Consortium for Informatics and Mathematics), France
- Chaudiron, Stéphane, Université Lille 3, France
- Constant, Patrick, PERTIMM, France
- Das, Abhishek, Google Inc., USA
- Desclés, Jean-Pierre, Université Paris-Sorbonne, France
- Dulong, Tanneguy, ARISEM (THALES), France
- Emam, Ossama, Cairo HLT Group IBM, USA
- Ferret, Olivier, LI2CM/CEA (Laboratoire d’Ingénierie de la Connaissance Multimédia Multilingue/Commissariat à l'Énergie Atomique), France
- Fluhr, Christian, Cedege/Hossur'Tech, France
- Fouladi, Karan, LIP6/ UMPC-CNRS (Laboratoire d’Informatique de Paris 6/ Université Pierre et Marie Curie and CNRS), France
- Gallinari, Patrick, LIP6 (UMPC/CNRS), France
- Ganascia, Jean-Gabriel, LIP6 (UMPC/CNRS), France
- Gargouri, Faiez, ISIM (Institut Supérieur d'Informatique et de Multimédia de Sfax), Tunisia
- Ghitalla, Frank, INIST (Institut de l'Information Scientifique et Technique), France
- Grau, Brigitte, LIMSI/CNRS (Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur), France
- Grefenstelle, Gregory, EXALEAD, France
- Habib, Bassel, LIP6 (UMPC-CNRS), France
- Jaziri, Wassim, ISIM, Sfax, Tunisia
- Huot, Charles, TEMIS Group, France
- Jain, Ankit, Google Inc., USA
- Jouis, Christophe, Université Paris Sorbonne Nouvelle and LIP6 (UMPC-CNRS), France
- Lassale, Edmond, Orange Labs (France Telecom), France
- Le Borgne, Hervé, , LI2CM (CEA), France
- Lucas, Philippe, TECHNOLOGIES group (Spirit software), France
- Meng, Fan, University of Michigan, USA
- Meunier, Jean-Guy, UQAM (Université du Québec à Montréal), Québec, Canada
- Moulinier, Isabelle, Thomson Reuters, USA
- Mustafa El-Hadi, IDIST, Universite Lille3, France
- Nie, Jian-Yun, Université de Montréal, Montreal, Quebec, Canada
- Piwowarski, Benjamin, Information Retrieval Group, University of Glasgow, UK
- Poupon, Anne, Equipe Biologie et Bioinformatique des Systèmes de Signalisation Physiologie du Comportement et de la Reproduction, France
- Riad, Mokadem, IRIT (Institut de Recherche en Informatique de Toulouse), France
- Robertson, Stephen, Microsoft Research Laboratory in Cambridge, UK
- Rocca-Serra, Philippe, The European Bioinformatics Institute, EMBL Outstation - Hinxton, Cambridge, UK
- Roux, Magali, LIP6 (UMPC-CNRS) and INIST, France
- Shafei, Bilal, ITS – BBE department, Columbia University, USA and An-Najah National University, Palestine
- Sansone, Susanna-Assunta, The European Bioinformatics Institute, EMBL Outstation - Hinxton , Cambridge, UK
- Savoy, Jacques, Université de Neuchâtel, Switzerland
- Smyth, Barry, Professor, University College Dublin, Ireland
- Stroppa, Nicolas, Yahoo! Labs, France
- Timimi, Ismaïl, IDIST, Universite Lille 3, France
- Velcin, Julien, ERIC Lab, University Lyon 2, France
- Vinot, Romain, Yahoo! Labs in Paris, France
- Wassermann, Renata, Computer Science Department, University of São Paulo, Brasil
- Zitouni, Imed, IBM T.J. Watson Research Center, USA