Rich-Prospect Browsing Interfaces

Rich-Prospect Browsing Interfaces

Stan Ruecker
DOI: 10.4018/978-1-60566-014-1.ch168
OnDemand:
(Individual Chapters)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

Everyone who has browsed the Internet is familiar with the problems involved in finding what they want. From the novice to the most sophisticated user, the challenge is the same: how to identify quickly and reliably the precise Web sites or other documents they seek from within an ever-growing collection of several billion possibilities? This is not a new problem. Vannevar Bush, the successful Director of the Office of Scientific Research and Development, which included the Manhattan project, made a famous public call in The Atlantic Monthly in 1945 for the scientific community in peacetime to continue pursuing the style of fruitful collaboration they had experienced during the war (Bush, 1945). Bush advocated this approach to address the central difficulty posed by the proliferation of information beyond what could be managed by any single expert using contemporary methods of document management and retrieval. Bush’s vision is often cited as one of the early visions of the World Wide Web, with professional navigators trailblazing paths through the literature and leaving sets of linked documents behind them for others to follow. Sixty years later, we have the professional indexers behind Google, providing the rest of us with a magic window into the data. We can type a keyword or two, pause for reflection, then hit the “I’m feeling lucky” button and see what happens. Technically, even though it often runs in a browser, this task is “information retrieval.” One of its fundamental tenets is that the user cannot manage the data and needs to be guided and protected through the maze by a variety of information hierarchies, taxonomies, indexes, and keywords. Information retrieval is a complex research domain. The Association for Computing Machinery, arguably the largest professional organization for academic computing scientists, sponsors a periodic contest in information retrieval, where teams compete to see who has the most effective algorithms. The contest organizers choose or create a document collection, such as a set of a hundred thousand newspaper articles in English, and contestants demonstrate their software’s ability to find the most documents most accurately. Two of the measures are precision and recall: both of these are ratios, and they pull in opposite directions. Precision is the ratio of the number of documents that have been correctly identified out of the number of documents returned by the search. Recall is the ratio of the number of documents that have been retrieved out of the total number in the collection that should have been retrieved. It is therefore possible to get 100% on precision—just retrieve one document precisely on topic. However, the corresponding recall score would be a disaster. Similarly, an algorithm can score 100% on recall just by retrieving all the documents in the collection. Again, the related precision score would be abysmal. Fortunately, information retrieval is not the only technology available. For collections that only contain thousands of entries, there is no reason why people should not be allowed to simply browse the entire contents, rather than being limited to carrying out searches. Certainly, retrieval can be part of browsing—the two technologies are not mutually exclusive. However, by embedding retrieval within browsing the user gains a significant number of perceptual advantages and new opportunities for actions.
Chapter Preview
Top

Introduction

Everyone who has browsed the Internet is familiar with the problems involved in finding what they want. From the novice to the most sophisticated user, the challenge is the same: how to identify quickly and reliably the precise Web sites or other documents they seek from within an ever-growing collection of several billion possibilities?

This is not a new problem. Vannevar Bush, the successful Director of the Office of Scientific Research and Development, which included the Manhattan project, made a famous public call in The Atlantic Monthly in 1945 for the scientific community in peacetime to continue pursuing the style of fruitful collaboration they had experienced during the war (Bush, 1945). Bush advocated this approach to address the central difficulty posed by the proliferation of information beyond what could be managed by any single expert using contemporary methods of document management and retrieval.

Bush’s vision is often cited as one of the early visions of the World Wide Web, with professional navigators trailblazing paths through the literature and leaving sets of linked documents behind them for others to follow. Sixty years later, we have the professional indexers behind Google, providing the rest of us with a magic window into the data. We can type a keyword or two, pause for reflection, then hit the “I’m feeling lucky” button and see what happens.

Technically, even though it often runs in a browser, this task is “information retrieval.” One of its fundamental tenets is that the user cannot manage the data and needs to be guided and protected through the maze by a variety of information hierarchies, taxonomies, indexes, and keywords. Information retrieval is a complex research domain. The Association for Computing Machinery, arguably the largest professional organization for academic computing scientists, sponsors a periodic contest in information retrieval, where teams compete to see who has the most effective algorithms. The contest organizers choose or create a document collection, such as a set of a hundred thousand newspaper articles in English, and contestants demonstrate their software’s ability to find the most documents most accurately. Two of the measures are precision and recall: both of these are ratios, and they pull in opposite directions. Precision is the ratio of the number of documents that have been correctly identified out of the number of documents returned by the search. Recall is the ratio of the number of documents that have been retrieved out of the total number in the collection that should have been retrieved. It is therefore possible to get 100% on precision—just retrieve one document precisely on topic. However, the corresponding recall score would be a disaster. Similarly, an algorithm can score 100% on recall just by retrieving all the documents in the collection. Again, the related precision score would be abysmal.

Fortunately, information retrieval is not the only technology available. For collections that only contain thousands of entries, there is no reason why people should not be allowed to simply browse the entire contents, rather than being limited to carrying out searches. Certainly, retrieval can be part of browsing—the two technologies are not mutually exclusive. However, by embedding retrieval within browsing the user gains a significant number of perceptual advantages and new opportunities for actions.

Key Terms in this Chapter

Semantic Web Services: Integrate Web services technology with machine supported data interpretation, using ontologies as a data model, to enable automatic discovery, selection, composition, and Web-based execution of services.

OWL: The Web ontology language is a semantic markup language for publishing and sharing ontologies on the Web. It is primarily aimed at representing information about categories of objects and how objects are interrelated. OWL can also represent information about the objects themselves.

XML: The extensible markup language is a metalanguage, or a language for describing languages. XML enables authors to define their own tags.

Ontology: A formal, explicit specification of a shared conceptualization. It is a formal description of the concepts and relationships that are relevant for a given domain.

Web Service: A software system identified by a URI, whose public interfaces and bindings are defined and described using XML.

Semantic Web: The Semantic Web is an extension of the current Web in which information is given well-defined meaning through the use of metadata and ontologies. It will allow the automatic access to resources using semantic descriptions amenable to be processed by software agents.

Complete Chapter List

Search this Book:
Reset