In the 1960s, automatic indexing methods for texts were developed. They had already implemented the “bag-ofwords” approach, which still prevails. Although automatic indexing is widely used today, many information providers and even Internet services still rely on human information work. In the 1970s, research shifted its interest to partial-match retrieval models and proved their superiority over Boolean retrieval models. Vector-space and later probabilistic retrieval models were developed. However, it took until the 1990s for partial-match models to succeed in the market. The Internet played a great role in this success. All Web search engines were based on partial-match models and provided ranked lists as results rather than unordered sets of documents. Consumers got used to this kind of search systems, and all big search engines included partial-match functionality. However, there are many niches in which Boolean methods still dominate, for example, patent retrieval. The basis for information retrieval systems may be pictures, graphics, videos, music objects, structured documents, or combinations thereof. This article is mainly concerned with information retrieval for text documents.
The user is in the center of the information retrieval process. Nevertheless, most research tends either to be more user oriented or more algorithm and system oriented. User-oriented research tries to pursue a holistic view of the process whereas system-oriented research is concerned with measuring the effect of system components and tries to resolve efficiency issues.
The information retrieval process is inherently vague. In most systems, documents and queries traditionally contain natural language. The content of these documents needs to be analyzed, which is a hard task for computers. Robust semantic analysis for large text collections or even multimedia objects has yet to be developed. Therefore, text documents are represented by natural-language terms mostly without syntactic or semantic context. This is often referred to as the bag-of-words approach. These keywords or terms can only imperfectly represent an object because their context and relations to other terms are lost.
As information retrieval needs to deal with vague knowledge, exact processing methods are not appropriate. Vague retrieval models like the probabilistic model are more suitable. As a consequence, the performance of a retrieval system cannot be predicted but must be determined in evaluations. Evaluation plays a key role in information retrieval. Evaluation needs to investigate how well a system supports the user in solving his or her knowledge problem (Baeza-Yates & Ribeiro-Neto, 1999).
Web search engines take the information retrieval process to the Internet. They need to contain the following modules (Arasu, Cho, Garcia-Molina, Paepcke, & Raghavan, 2001).
Key Terms in this Chapter
Precision: Precision is a quality measure for information retrieval evaluation. It gives the percentage of relevant documents within the document set. Precision can be calculated by dividing the number of relevant documents that were found by the number of documents found.
Indexing: Indexing is the assignment of terms (words) that represent a document. Indexing can be carried out manually or automatically. Automatic indexing requires the elimination of stop words and stemming.
Stemming: Stemming refers to the mapping of word forms to stems or basic word forms. Word forms may differ from stems due to morphological changes necessary for grammatical reasons. The plural versions of English nouns, for example, are mostly constructed by adding an s to the basic noun. In most European languages, stemming needs to strip suffixes from word forms.
Information Retrieval: Information retrieval is concerned with the representation of knowledge and subsequent search for relevant information within these knowledge sources. Information retrieval provides the technology behind search engines.
Inverse Document Frequency (IDF): IDF is a traditional weighting scheme for terms. It can be calculated as the logarithm of the term frequency in the document divided by the frequency of the term in the whole collection.
Term Weighting: Weighting determines the importance of a term for a document. Weights are calculated by many different formulas that consider the frequency of each term in a document and in the collection, as well as the length of the document and the average or maximum length of any document in the collection.
Recall: Recall is a quality measure for information retrieval evaluation. It can be calculated by dividing the number of relevant documents that were found by the number of relevant documents in the collection. The second figure can often only be estimated.
Link Analysis: The links between pages on the Web are a large knowledge source that is exploited by link analysis algorithms for many ends. Many algorithms similar to PageRank determine a quality or authority score based on the number of incoming links of a page. Furthermore, link analysis is applied to identify thematically similar pages, Web communities, and other social structures.