Over 60% of the online population are non-English speakers and it is probable the number of non-English speakers is growing faster than English speakers. Most search engines were originally engineered for English. They do not take full account of inflectional semantics nor, for example, diacritics or the use of capitals. The main conclusion from the literature is that searching using non-English and non-Latin based queries results in lower success and requires additional user effort so as to achieve acceptable recall and precision. In this chapter a Greek query log is morphologically and grammatically analyzed and a number of queries are submitted to search engines and their relevance is evaluated with the aid of real users. A Greek meta-searcher redirecting normalized queries to Google.gr is also presented and evaluated. An increase in relevance is reported when stopwords are eliminated and queries are normalized based on their morphology.
According to recent statistics 64.2% of the online population, are non-English users (Global Internet Statistics, 2003). As the Web population continues to grow more non-English users will be amassed online. Recent studies showed that non-English queries and unclassifiable queries have nearly tripled since 1997 (Spink et al., 2002). Even though several Web search engines exist, most of their features and virtues are catered for the English language only. For example, the query “Bookshop New York” in Google retrieves Web pages mentioning the semantically related words “book”, “books” and “bookstore” as well. This is easily understood as the matching terms are emboldened. In contrast, the queries “Librairie Paris” in French, “Libreria Roma” in Italian, “Librería Madrid” in Spanish and “Βιβλιοπωλείο Αθήνα” in Greek, retrieve only pages which include exactly the query terms as they are typed in the query. Another more convincing example results from the query “stemming site:http://video.yahoo.com] operate practically only on Latin named resources and Web pages.
To effectively support the information needs of non English and non Latin Web searchers, we need primarily to understand how users interact with search engines and to thoroughly study their queries. Then the relevance of queries following specific patterns should be evaluated. Finally, in order to improve Web searching in a specific natural language new tools and techniques should be proposed taking into account the linguistic features and restrictions of this language.
Key Terms in this Chapter
Query: A user query is the expression of the user information need usually in natural language. Some retrieval systems allow the use of Boolean connectives between the query terms.
Information Retrieval: Information retrieval (IR) is the science of searching for information in documents, searching for documents themselves, searching for metadata which describe documents, or searching within hypertext collections such as the Internet or intranets.
Document: A unit of retrieval. It might be a paragraph, a section, an article, a chapter, a Web page or a whole book.
Search Engine: Search engines are advanced searching systems operating on hypertext collections. Search engines attempt to locate relevant Web pages, images, video and sounds to a user query. They additionally offer a number of specialized services such as book search, blog search, maps, e-shopping, etc.
Text Based Image Retrieval/Concept Based Image Retrieval: In text based (concept based) image retrieval, images are annotated with a textual description and their retrieval is based on matching the user’s textual query to the annotation of the image.
Index: Index refers to a database containing the most important terms of each document which has been statistically analyzed by a retrieval system. Index terms or keywords contained in the index of each search engine are matched to the user query terms so as to retrieve the most relevant documents. Traditional retrieval systems keep only the terms carrying significant information in their indexes. Search engines store all the terms contained in Web pages to support “exact matching” and “all the words” queries.
Query Expansion: A process of adding new terms to a given query in an attempt to provide better contextualization and hopefully retrieve documents which are more useful to the user.
Lemmatization: Lemmatization involves the reduction of words to their respective lemmas. For example, the lemma for the words “computation” and “computer” is the word “compute”. Lemmatizers operate on single and compound terms and on phrases while stemmers take as input single words only.
Precision/Relevance: Precision is an information retrieval performance measure that quantifies the fraction of retrieved documents which are known to be relevant.
Inflection: Inflection is variation in the form of a word, typically by means of an affix, that expresses a grammatical contrast which is obligatory for the stem’s word class in some given grammatical context.
Stopwords: Stopwords are the common words with low discriminatory power efficient to distinguish between documents. Usual candidates of the stopword list are articles, prepositions and conjunctions, although specific nouns, verbs or other grammatical types could be of low importance in terms of information retrieval in specific domains.
Query Term/Keyword: Query terms (keywords) are the words contained in a user query. Boolean operators or wildcards are not considered as query terms. They are operators used to link query terms.
Stemming: Stemming is the process of reducing a word to its stem or root form. For the purposes of IR, morphological variants of words have similar semantic interpretations and can be considered as equivalent. For example, the word “computation” might be stemmed to “comput”. Stemming is either based on linguistic dictionaries or on algorithms.