Conventional Web search engines return long lists of ranked documents that users are forced to sift through to find relevant documents. The notoriously-low precision of Web search engines coupled with the ranked list presentation make it hard for users to find the information they seek. Developing retrieval techniques that will yield high recall and high precision is desirable. Unfortunately, such techniques would impose additional resource demands on the search engines which are already under severe resource constraints. A more productive approach, however, seems to enhance post-processing of the retrieved set. If such value-adding processes allow the user to easily identify relevant documents from a large retrieved set, queries that produce low precision/high recall results will become more acceptable. We propose improving the quality of Web search by combining meta-search and self-organizing maps. This can help users both in locating interesting documents more easily and in getting an overview of the retrieved document set.
Key Terms in this Chapter
Search Result Ranking: Ranking, in general, is the process of positioning items such as individuals, groups, or businesses on an ordinal scale in relation to others. A list arranged in this way is said to be in rank order. Search engines rank Web pages depending on their relevance to a user’s query. Each major search engine is unique in how it determines page rank. There is a growing business in trying to trick search engines into giving a higher page rank to particular Web pages as a marketing tool. The makers of search engines, of course, strive to make sure that such tricks are ineffective. One way that they do this is by keeping their algorithmic details confidential. They also may play the spy versus spy game of watching for the use of such tricks and refining their ranking algorithms to circumvent the tricks. At the same time, some search companies try to play double agent by selling improved page rank (positioning in search results).
Information Overload: Historically, more information has almost always been a good thing. However, as the ability to collect information grew, the ability to process that information did not keep up. Today, we have large amounts of available information and a high rate of new information being added, but contradictions in the available information, a low signal-to-noise ratio (proportion of useful information found to all information found), and inefficient methods for comparing and processing different kinds of information characterize the situation. The result is the “information overload” of the user, that is, users have too much information to make a decision or remain informed about a topic.
Inverted Index: An inverted index is an index into a set of documents of the words in the documents. The index is accessed by some search method. Each index entry gives the word and a list of documents, possibly with locations within the documents, where the word occurs. The inverted index data structure is a central component of a typical search engine indexing algorithm. A goal of a search engine implementation is to optimize the speed of the query: find the documents where word X occurs. Once a forward index is developed, which stores lists of words per document, it is next inverted to develop an inverted index. Querying the forward index would require sequential iteration through each document and to each word to verify a matching document. The time, memory, and processing resources to perform such a query are not always technically realistic. Instead of listing the words per document in the forward index, the inverted index data structure is developed, which lists the documents per word. With the inverted index created, the query can now be resolved by jumping to the word ID (via random access) in the inverted index. Random access is generally regarded as being faster than sequential access.
Unsupervised Learning: Consider a system which receives some sequence of inputs x1, x2, x3, …, where xt is the sensory input at time t. This input, called the data, could correspond to an image on the retina, the pixels in a camera, or a sound waveform. It could also correspond to less-obviously sensory data, for example, the words in a news story, or the list of items in a supermarket shopping basket. In unsupervised learning, the system simply receives inputs x1, x2, …, but obtains neither supervised target outputs, nor rewards from its environment. It may seem somewhat mysterious to imagine what the system could possibly learn, given that it does not get any feedback from its environment. However, it is possible to develop a formal framework for unsupervised learning based on the notion that the system’s goal is to build representations of the input that can be used for decision-making, predicting future inputs, efficiently communicating the inputs to another system, and so forth. In a sense, unsupervised learning can be thought of as finding patterns in the data above and beyond what would be considered pure, unstructured noise. Two very simple classic examples of unsupervised learning are clustering and dimensionality reduction.
Quality of Web Search: Seen from a user’s perspective, this term is related to the notion of “user satisfaction”. The more satisfied that a user is with the search results and the different aspects of searching, the higher is the rating of the search system. Assessing the quality of a Web search system and the results that it produces is notoriously difficult. For search results, criteria for determining the good, the bad, and the ugly include: scope and depth of coverage, authority, currency, accuracy and reliability, motive and purpose, ease of use and design issues, and so forth. Web search systems are used by a heterogeneous user population for a wide variety of tasks: from finding a specific Web document that the user has seen before and can easily describe, to obtaining an overview of an unfamiliar topic, to exhaustively examining a large set of documents on a topic, and more. A search system will prove useful only in a subset of these cases.
Recall and Precision: Recall and precision are two retrieval evaluation measures for information retrieval systems. Precision describes the ability of the system to retrieve top-ranked documents that are mostly relevant. Recall describes the ability of the system to find all of the relevant items in the corpus. If I is an example information request (from a test reference collection), R is the set of relevant documents for I (provided by specialists), A is the document answer set for I generated by the system being evaluated, and Ra = RnA is the set of relevant documents in the answer set, then recall = |Ra|/|R| and precision = |Ra|/|A|.
Browsing: The definition of browsing is to inspect, in a leisurely and casual way, a body of information, usually on the World Wide Web, based on the organization of the collections, without clearly-defined intentions. Hypertext is an appropriate conceptual model for organization. Usually, hypertext systems encourage browsing by stimulating the user to follow links. Today, most hypertext systems employ the point-and-click paradigm for user interaction; information is just one click (of the mouse button) away.
Information Customization (IC) Systems: IC systems are systems that customize information to the needs and interests of the user. They function proactively (take the initiative), continuously scan appropriate resources, analyze and compare content, select relevant information, and present it as visualizations or in a pruned format. Building software that can interact with the range and diversity of the online resources is a challenge, and the promise of IC systems is becoming highly attractive. Instead of users investing significant effort to find the right information, the right information should find the users. IC systems attempt to accomplish this by automating many functions of today’s information retrieval systems and providing features to optimally use information.