Indexing the World Wide Web: The Journey So Far

Indexing the World Wide Web: The Journey So Far

Abhishek Das (Google Inc., USA) and Ankit Jain (Google Inc., USA)
DOI: 10.4018/978-1-4666-0330-1.ch001
OnDemand PDF Download:
$30.00
List Price: $37.50

Abstract

In this chapter, the authors describe the key indexing components of today’s web search engines. As the World Wide Web has grown, the systems and methods for indexing have changed significantly. The authors present the data structures used, the features extracted, the infrastructure needed, and the options available for designing a brand new search engine. Techniques are highlighted that improve relevance of results, discuss trade-offs to best utilize machine resources, and cover distributed processing concepts in this context. In particular, the authors delve into the topics of indexing phrases instead of terms, storage in memory vs. on disk, and data partitioning. Some thoughts on information organization for the newly emerging data-forms conclude the chapter.
Chapter Preview
Top

Introduction

The World Wide Web is considered to be the greatest breakthrough in telecommunications after the telephone, radically altering the availability and accessibility to information. Quoting the new media reader from MIT press (Wardrip-Fruin, 2003):

“The World-Wide Web (W3) was developed to be a pool of human knowledge, and human culture, which would allow collaborators in remote sites to share their ideas and all aspects of a common project.”

The last two decades have witnessed many significant attempts to make this knowledge “discoverable”. These attempts broadly fall into two categories:

  • 1.

    Classification of webpages in hierarchical categories (directory structure), championed by the likes of Yahoo! and Open Directory Project;

  • 2.

    Full-text index search engines such as Excite, AltaVista, and Google.

The former is an intuitive method of arranging web pages, where subject-matter experts collect and annotate pages for each category, much like books are classified in a library. With the rapid growth of the web, however, the popularity of this method gradually declined. First, the strictly manual editorial process could not cope with the increase in the number of web pages. Second, the user’s idea of what sub-tree(s) to seek for a particular topic was expected to be in line with the editors’, who were responsible for the classification. We are most familiar with the latter approach today, which presents the user with a keyword search interface and uses a pre-computed web index to algorithmically retrieve and rank web pages that satisfy the query. In fact, this is probably the most widely used method for navigating through cyberspace today, primarily because it can scale as the web grows. Even though the indexable web is only a small fraction of the web (Selberg, 1999), the earliest search engines had to handle orders of magnitude more documents than previous information retrieval systems. Around 1995, when the number of static web pages was believed to double every few months, AltaVista reported having crawled and indexed approximately 25 million webpages. In 1997, the total estimated number of pages indexed by all the largest search engines was 200 million pages (Bharat, 1998), which reportedly grew to 800 million pages by 1998 (Lawrence, 1999). Indices of today’s search engines are several orders of magnitude larger (Gulli, 2005); Google reported around 25 billion web pages in 2005 (Patterson, 2005), while Cuil indexed 120 billion pages in 2008 (Arrington, 2008). Harnessing together the power of hundreds, if not thousands, of machines has proven key in addressing this challenge of grand scale.

Complete Chapter List

Search this Book:
Reset
Editorial Advisory Board
Table of Contents
Chapter 1
Abhishek Das, Ankit Jain
In this chapter, the authors describe the key indexing components of today’s web search engines. As the World Wide Web has grown, the systems and... Sample PDF
Indexing the World Wide Web: The Journey So Far
$30.00
List Price: $37.50
Chapter 2
Weimao Ke
Amid the rapid growth of information today is the increasing challenge for people to navigate its magnitude. Dynamics and heterogeneity of large... Sample PDF
Decentralized Search and the Clustering Paradox in Large Scale Information Networks
$30.00
List Price: $37.50
Chapter 3
Magali Roux
E-sciences are data-intensive sciences that make a large use of the Web to share, collect, and process data. In this context, primary scientific... Sample PDF
Metadata for Search Engines: What can be learned from e-Sciences?
$30.00
List Price: $37.50
Chapter 4
Christian Fluhr
This paper is about search of photos in photo databases of agencies which sell photos over the Internet. The problem is far from the behavior of... Sample PDF
Crosslingual Access to Photo Databases
$30.00
List Price: $37.50
Chapter 5
Hanêne Ghorbel, Afef Bahri, Rafik Bouaziz
The unstructured design of Web resources favors human comprehension, but makes difficult the automatic exploitation of the contents of these... Sample PDF
Fuzzy Ontologies Building Platform for Semantic Web: FOB Platform
$30.00
List Price: $37.50
Chapter 6
Brahim Djioua, Jean-Pierre Desclés, Motasem Alrahabi
A new model is proposed to retrieve information by building automatically a semantic metatext1 structure for texts that allow searching and... Sample PDF
Searching and Mining with Semantic Categories
$30.00
List Price: $37.50
Chapter 7
Edmond Lassalle, Emmanuel Lassalle
Robertson and Spärck Jones pioneered experimental probabilistic models (Binary Independence Model) with both a typology generalizing the Boolean... Sample PDF
Semantic Models in Information Retrieval
$30.00
List Price: $37.50
Chapter 8
Michael W. Berry, Reed Esau, Bruce Kiefer
Electronic discovery (eDiscovery) is the process of collecting and analyzing electronic documents to determine their relevance to a legal matter.... Sample PDF
The Use of Text Mining Techniques in Electronic Discovery for Legal Matters
$30.00
List Price: $37.50
Chapter 9
Mona Sleem-Amer, Ivan Bigorgne, Stéphanie Brizard, Leeley Daio Pires Dos Santos, Yacine El Bouhairi, Bénédicte Goujon, Stéphane Lorin, Claude Martineau, Loïs Rigouste, Lidia Varga
Over the last years, research and industry players have become increasingly interested in analyzing opinions and sentiments expressed on the social... Sample PDF
Intelligent Semantic Search Engines for Opinion and Sentiment Mining
$30.00
List Price: $37.50
Chapter 10
Human-Centred Web Search  (pages 217-238)
Orland Hoeber
People commonly experience difficulties when searching the Web, arising from an incomplete knowledge regarding their information needs, an inability... Sample PDF
Human-Centred Web Search
$30.00
List Price: $37.50
Chapter 11
Sarah Vert
This chapter focuses on the Internet working environment of Knowledge Workers through the customization of the Web browser on their computer. Given... Sample PDF
Extensions of Web Browsers useful to Knowledge Workers
$30.00
List Price: $37.50
Chapter 12
Lin-Chih Chen
Result clustering has recently attracted a lot of attention to provide the users with a succinct overview of relevant search results than... Sample PDF
Next Generation Search Engine for the Result Clustering Technology
$30.00
List Price: $37.50
Chapter 13
Ismaïl Biskri, Louis Rompré
In this paper the authors will present research on the combination of two methods of data mining: text classification and maximal association rules.... Sample PDF
Using Association Rules for Query Reformulation
$30.00
List Price: $37.50
Chapter 14
Question Answering  (pages 304-343)
Ivan Habernal, Miloslav Konopík, Ondrej Rohlík
Question Answering is an area of information retrieval with the added challenge of applying sophisticated techniques to identify the complex... Sample PDF
Question Answering
$30.00
List Price: $37.50
Chapter 15
Brigitte Grau
This chapter is dedicated to factual question answering, i.e., extracting precise and exact answers to question given in natural language from... Sample PDF
Finding Answers to Questions, in Text Collections or Web, in Open Domain or Specialty Domains
$30.00
List Price: $37.50
Chapter 16
Jawad Berri, Rachid Benlamri
Exploiting context information in a web search engine helps fine-tuning web services and applications to deliver custom-made information to end... Sample PDF
Context-Aware Mobile Search Engine
$30.00
List Price: $37.50
Chapter 17
Ourdia Bouidghaghen, Lynda Tamine
The explosion of the information available on the Internet has made traditional information retrieval systems, characterized by one size fits all... Sample PDF
Spatio-Temporal Based Personalization for Mobile Search
$30.00
List Price: $37.50
Chapter 18
Stéphane Chaudiron, Madjid Ihadjadene
This chapter shows that the wider use of Web search engines, reconsidering the theoretical and methodological frameworks to grasp new information... Sample PDF
Studying Web Search Engines from a User Perspective: Key Concepts and Main Approaches
$30.00
List Price: $37.50
Chapter 19
Faruk Karaman
Search engines are the major means of information retrieval over the Internet. People’s dependence on them increases over time as SEs introduce new... Sample PDF
Artificial Intelligence Enabled Search Engines (AIESE) and the Implications
$30.00
List Price: $37.50
Chapter 20
Dirk Lewandowski
This chapter presents a theoretical framework for evaluating next generation search engines. The author focuses on search engines whose results... Sample PDF
A Framework for Evaluating the Retrieval Effectiveness of Search Engines
$30.00
List Price: $37.50
About the Contributors