This chapter surveys various technologies involved in a Web search engine with an emphasis on performance analysis issues. The aspects of a general-purpose search engine covered in this survey include system architectures, information retrieval theories as the basis of Web search, indexing and ranking of Web documents, relevance feedback and machine learning, personalization, and performance measurements. The objectives of the chapter are to review the theories and technologies pertaining to Web search, and help us understand how Web search engines work and how to use the search engines more effectively and efficiently.
Key Terms in this Chapter
Document Frequency: The number of documents containing a particular term.
Inverted Index: An indexing system in which the terms point to documents to which the terms belong.
Term Frequency: The number of times that a term appears in a document.
Relevance Feedback: A mechanism through which an IR system generates a set of results for a given query; the user is allowed to send feedback of some form to the IR system to improve search accuracy.
Estimated Search Length (ESL): The average number of irrelevant documents that one has to examine in order to retrieve a given number of relevant documents.
Cosine Similarity: A measure used to evaluate the relevance between a query and a document in vector space model; this measure is based on the cosine of the angle between the two vectors, the query, and the document.
Rank: The order with which the retrieved documents are presented; the closer to the beginning of the list, the more favored the document is.
Averaged Search Length (ASL): The expected position of a relevant document in the ordered list of all documents.
Information Retrieval: A branch of science that deals with the representation, storage, organization of, and access to information with the prime aim of retrieval information for a given set of queries.
Vector Space Model: A model in which all documents are represented as a vector of weights contributed by each of the terms found in these documents.