Indexing the World Wide Web: The Journey So Far

Indexing the World Wide Web: The Journey So Far

Abhishek Das (Google Inc., USA) and Ankit Jain (Google Inc., USA)
DOI: 10.4018/978-1-4666-0330-1.ch001
OnDemand PDF Download:
$30.00
List Price: $37.50

Abstract

In this chapter, the authors describe the key indexing components of today’s web search engines. As the World Wide Web has grown, the systems and methods for indexing have changed significantly. The authors present the data structures used, the features extracted, the infrastructure needed, and the options available for designing a brand new search engine. Techniques are highlighted that improve relevance of results, discuss trade-offs to best utilize machine resources, and cover distributed processing concepts in this context. In particular, the authors delve into the topics of indexing phrases instead of terms, storage in memory vs. on disk, and data partitioning. Some thoughts on information organization for the newly emerging data-forms conclude the chapter.
Chapter Preview
Top

Introduction

The World Wide Web is considered to be the greatest breakthrough in telecommunications after the telephone, radically altering the availability and accessibility to information. Quoting the new media reader from MIT press (Wardrip-Fruin, 2003):

“The World-Wide Web (W3) was developed to be a pool of human knowledge, and human culture, which would allow collaborators in remote sites to share their ideas and all aspects of a common project.”

The last two decades have witnessed many significant attempts to make this knowledge “discoverable”. These attempts broadly fall into two categories:

  • 1.

    Classification of webpages in hierarchical categories (directory structure), championed by the likes of Yahoo! and Open Directory Project;

  • 2.

    Full-text index search engines such as Excite, AltaVista, and Google.

The former is an intuitive method of arranging web pages, where subject-matter experts collect and annotate pages for each category, much like books are classified in a library. With the rapid growth of the web, however, the popularity of this method gradually declined. First, the strictly manual editorial process could not cope with the increase in the number of web pages. Second, the user’s idea of what sub-tree(s) to seek for a particular topic was expected to be in line with the editors’, who were responsible for the classification. We are most familiar with the latter approach today, which presents the user with a keyword search interface and uses a pre-computed web index to algorithmically retrieve and rank web pages that satisfy the query. In fact, this is probably the most widely used method for navigating through cyberspace today, primarily because it can scale as the web grows. Even though the indexable web is only a small fraction of the web (Selberg, 1999), the earliest search engines had to handle orders of magnitude more documents than previous information retrieval systems. Around 1995, when the number of static web pages was believed to double every few months, AltaVista reported having crawled and indexed approximately 25 million webpages. In 1997, the total estimated number of pages indexed by all the largest search engines was 200 million pages (Bharat, 1998), which reportedly grew to 800 million pages by 1998 (Lawrence, 1999). Indices of today’s search engines are several orders of magnitude larger (Gulli, 2005); Google reported around 25 billion web pages in 2005 (Patterson, 2005), while Cuil indexed 120 billion pages in 2008 (Arrington, 2008). Harnessing together the power of hundreds, if not thousands, of machines has proven key in addressing this challenge of grand scale.

Complete Chapter List

Search this Book:
Reset