Retrieving Non-Latin Information in a Latin Web: The Case of Greek

Fotis Lazarinis

doi:10.4018/978-1-59904-990-8.ch031

Special Offers
- IGI Global’s New Emerging Topic e-Book Collections
  Acquire highly focused and affordable Cutting-Edge Peer-Reviewed Research Content through a selection of 17 topic-focused e-Book Collections discounted up to 90%, compared to list prices. Collection topics include Artificial Intelligence, Data Science, Language Learning, Marketing and Customer Relations, Sustainability, and many more. Hosted on the InfoSci^® platform, these collections feature no DRM, no additional cost for multi-user licensing, no embargo of content, full-text PDF & HTML format, and more.
  Learn More
- Open Access Book (Free Access) - Encyclopedia of Information Science and Technology, Sixth Edition (ISBN: 9781668473665)
  The Encyclopedia of Information Science and Technology, Sixth Edition) continues the legacy set forth by the first five editions by providing comprehensive coverage and up-to-date definitions of the most important issues, concepts, and trends pertaining to technological advancements and information management within a variety of settings and industries. The entire book is being published under open access.
  Read Now
- Open Access Book (Free Access) - Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries (ISBN: 9781668456293)
  Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries provides information on the recent technology, mitigation, and environmental protection that must be applied for food sustainability in developing countries. This book is being published under Platinum Open Access through funding from Diponegoro University, Indonesia.
  Read Now
- Open Access Book (Free Access) - New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY (ISBN: 9781668438091)
  The Walmart Corporation and the Lumina Foundation have provided funding to make New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY fully open access, completely removing any paywall between scholars in education and the latest research on new models for the future of higher education.
  Read Now
- Open Access Book (Free Access) - Handbook of Research on the Global View of Open Access and Scholarly Communications (ISBN: 9781799898054)
  Through a collaboration between IGI Global and the University of North Texas, the Handbook of Research on the Global View of Open Access and Scholarly Communications has been published as fully open access, completely removing any paywall between researchers of any field, and the latest research on the equitable and inclusive nature of Open Access and all of its complications.
  Read Now
Books
- - Books by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Books by Field
Journals
- - Journals
  - OnDemand Journal Articles
  - Journals by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Journals by Field
e-Collections
OnDemand
Open Access
- View All Open Access Opportunities
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Find an Open Access Journal for Your Next Manuscript
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Submit an Open Access Book Proposal
  Learn more about open access book publishing and how it can propel your research forward in the field.
  Convert Your Work to Open Access
  Already published? You can convert your work to open access to increase its impact through IGI Global’s Restrospective Open Access Program.
  Utilize Open Access Collection Database
  Open up your research potential by utilizing our open access content or integrating the open access collection into your library
  Consider Open Access Agreements
  For Libraries: consider no-cost or investment-level open access agreements with IGI Global to support your faculty's research endeavors.
  Search Funding Resources
  Looking for additional funding resources to support your open accesss endeavors? View industry resources compiled by our open access team.
  Review Open Access Policies & Ethical Guidelines
  Considering IGI Global to publish your work under open access? Review IGI Global’s open access policies and ethical guidelines
Publish with Us
Resources
- - Instructors
  - Course Adoption
  - Teaching Cases
  - K-12 Online Learning Collection
  - Authors and Editors
  - eEditorial Discovery^® System
  - Peer Review Process
  - Ethics and Malpractice
  - COPE Membership
  - Fair Use Policy
  - Open Access Publishing
  - FAQ
Catalogs
About Us

Retrieving Non-Latin Information in a Latin Web: The Case of Greek

Fotis Lazarinis

Source Title: Handbook of Research on Text and Web Mining Technologies

DOI: 10.4018/978-1-59904-990-8.ch031

OnDemand:

(Individual Chapters)

Available

$37.50

Current Special Offers

No Current Special Offers

Abstract

Over 60% of the online population are non-English speakers and it is probable the number of non-English speakers is growing faster than English speakers. Most search engines were originally engineered for English. They do not take full account of inflectional semantics nor, for example, diacritics or the use of capitals. The main conclusion from the literature is that searching using non-English and non-Latin based queries results in lower success and requires additional user effort so as to achieve acceptable recall and precision. In this chapter a Greek query log is morphologically and grammatically analyzed and a number of queries are submitted to search engines and their relevance is evaluated with the aid of real users. A Greek meta-searcher redirecting normalized queries to Google.gr is also presented and evaluated. An increase in relevance is reported when stopwords are eliminated and queries are normalized based on their morphology.

Chapter Preview

Top

Introduction

According to recent statistics 64.2% of the online population, are non-English users (Global Internet Statistics, 2003). As the Web population continues to grow more non-English users will be amassed online. Recent studies showed that non-English queries and unclassifiable queries have nearly tripled since 1997 (Spink et al., 2002). Even though several Web search engines exist, most of their features and virtues are catered for the English language only. For example, the query “Bookshop New York” in Google retrieves Web pages mentioning the semantically related words “book”, “books” and “bookstore” as well. This is easily understood as the matching terms are emboldened. In contrast, the queries “Librairie Paris” in French, “Libreria Roma” in Italian, “Librería Madrid” in Spanish and “Βιβλιοπωλείο Αθήνα” in Greek, retrieve only pages which include exactly the query terms as they are typed in the query. Another more convincing example results from the query “stemming site:http://video.yahoo.com] operate practically only on Latin named resources and Web pages.

To effectively support the information needs of non English and non Latin Web searchers, we need primarily to understand how users interact with search engines and to thoroughly study their queries. Then the relevance of queries following specific patterns should be evaluated. Finally, in order to improve Web searching in a specific natural language new tools and techniques should be proposed taking into account the linguistic features and restrictions of this language.

Key Terms in this Chapter

Query: A user query is the expression of the user information need usually in natural language. Some retrieval systems allow the use of Boolean connectives between the query terms.

Information Retrieval: Information retrieval (IR) is the science of searching for information in documents, searching for documents themselves, searching for metadata which describe documents, or searching within hypertext collections such as the Internet or intranets.

Document: A unit of retrieval. It might be a paragraph, a section, an article, a chapter, a Web page or a whole book.

Search Engine: Search engines are advanced searching systems operating on hypertext collections. Search engines attempt to locate relevant Web pages, images, video and sounds to a user query. They additionally offer a number of specialized services such as book search, blog search, maps, e-shopping, etc.

Text Based Image Retrieval/Concept Based Image Retrieval: In text based (concept based) image retrieval, images are annotated with a textual description and their retrieval is based on matching the user’s textual query to the annotation of the image.

Index: Index refers to a database containing the most important terms of each document which has been statistically analyzed by a retrieval system. Index terms or keywords contained in the index of each search engine are matched to the user query terms so as to retrieve the most relevant documents. Traditional retrieval systems keep only the terms carrying significant information in their indexes. Search engines store all the terms contained in Web pages to support “exact matching” and “all the words” queries.

Query Expansion: A process of adding new terms to a given query in an attempt to provide better contextualization and hopefully retrieve documents which are more useful to the user.

Lemmatization: Lemmatization involves the reduction of words to their respective lemmas. For example, the lemma for the words “computation” and “computer” is the word “compute”. Lemmatizers operate on single and compound terms and on phrases while stemmers take as input single words only.

Precision/Relevance: Precision is an information retrieval performance measure that quantifies the fraction of retrieved documents which are known to be relevant.

Inflection: Inflection is variation in the form of a word, typically by means of an affix, that expresses a grammatical contrast which is obligatory for the stem’s word class in some given grammatical context.

Stopwords: Stopwords are the common words with low discriminatory power efficient to distinguish between documents. Usual candidates of the stopword list are articles, prepositions and conjunctions, although specific nouns, verbs or other grammatical types could be of low importance in terms of information retrieval in specific domains.

Query Term/Keyword: Query terms (keywords) are the words contained in a user query. Boolean operators or wildcards are not considered as query terms. They are operators used to link query terms.

Stemming: Stemming is the process of reducing a word to its stem or root form. For the purposes of IR, morphological variants of words have similar semantic interpretations and can be considered as equivalent. For example, the word “computation” might be stemmed to “comput”. Stemming is either based on linguistic dictionaries or on algorithms.

Complete Chapter List

Search this Book:

Reset

MLA

APA

Chicago

Export Reference

Retrieving Non-Latin Information in a Latin Web: The Case of Greek

Abstract

Introduction

Key Terms in this Chapter

Complete Chapter List