Special Offers
- IGI Global’s New Emerging Topic e-Book Collections
  Acquire highly focused and affordable Cutting-Edge Peer-Reviewed Research Content through a selection of 17 topic-focused e-Book Collections discounted up to 90%, compared to list prices. Collection topics include Artificial Intelligence, Data Science, Language Learning, Marketing and Customer Relations, Sustainability, and many more. Hosted on the InfoSci^® platform, these collections feature no DRM, no additional cost for multi-user licensing, no embargo of content, full-text PDF & HTML format, and more.
  Learn More
- Open Access Book (Free Access) - Encyclopedia of Information Science and Technology, Sixth Edition (ISBN: 9781668473665)
  The Encyclopedia of Information Science and Technology, Sixth Edition) continues the legacy set forth by the first five editions by providing comprehensive coverage and up-to-date definitions of the most important issues, concepts, and trends pertaining to technological advancements and information management within a variety of settings and industries. The entire book is being published under open access.
  Read Now
- Open Access Book (Free Access) - Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries (ISBN: 9781668456293)
  Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries provides information on the recent technology, mitigation, and environmental protection that must be applied for food sustainability in developing countries. This book is being published under Platinum Open Access through funding from Diponegoro University, Indonesia.
  Read Now
- Open Access Book (Free Access) - New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY (ISBN: 9781668438091)
  The Walmart Corporation and the Lumina Foundation have provided funding to make New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY fully open access, completely removing any paywall between scholars in education and the latest research on new models for the future of higher education.
  Read Now
- Open Access Book (Free Access) - Handbook of Research on the Global View of Open Access and Scholarly Communications (ISBN: 9781799898054)
  Through a collaboration between IGI Global and the University of North Texas, the Handbook of Research on the Global View of Open Access and Scholarly Communications has been published as fully open access, completely removing any paywall between researchers of any field, and the latest research on the equitable and inclusive nature of Open Access and all of its complications.
  Read Now
Books
- - Books by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Books by Field
Journals
- - Journals
  - OnDemand Journal Articles
  - Journals by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Journals by Field
e-Collections
Open Access
- View All Open Access Opportunities
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Find an Open Access Journal for Your Next Manuscript
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Submit an Open Access Book Proposal
  Learn more about open access book publishing and how it can propel your research forward in the field.
  Convert Your Work to Open Access
  Already published? You can convert your work to open access to increase its impact through IGI Global’s Restrospective Open Access Program.
  Utilize Open Access Collection Database
  Open up your research potential by utilizing our open access content or integrating the open access collection into your library
  Consider Open Access Agreements
  For Libraries: consider no-cost or investment-level open access agreements with IGI Global to support your faculty's research endeavors.
  Search Funding Resources
  Looking for additional funding resources to support your open accesss endeavors? View industry resources compiled by our open access team.
  Review Open Access Policies & Ethical Guidelines
  Considering IGI Global to publish your work under open access? Review IGI Global’s open access policies and ethical guidelines
Publish with Us
Resources
- - Instructors
  - Course Adoption
  - Teaching Cases
  - K-12 Online Learning Collection
  - Authors and Editors
  - eEditorial Discovery^® System
  - Peer Review Process
  - Ethics and Malpractice
  - COPE Membership
  - Fair Use Policy
  - Open Access Publishing
  - FAQ
Catalogs
About Us
Newsroom

Document Search Images in Text Collections for Restricted Domains on Websites

Pavel Makagonov, Celia B.Reyes E., Grigori Sidorov

Source Title: Quantitative Semantics and Soft Computing Methods for the Web: Perspectives and Applications

DOI: 10.4018/978-1-60960-881-1.ch009

OnDemand:

(Individual Chapters)

Available

$37.50

Current Special Offers

No Current Special Offers

Abstract

The main idea of the authors’ research is to perform quantitative analysis of a text collection during the process of its preparation and transformation into a digital library for a website. They use as a case study the digital library of the website on Mixtec culture that we maintain. The authors propose using the concept of the text document search image (TDSI). For creating TDSIs they make analysis of word frequencies in the documents and distinguish between the Zipf’s distribution that is typical for meaningful words and distributions approximated by an ellipse typical for auxiliary words. The authors also describe some analogies of these distributions in architecture and in urban planning. We describe a toolkit DDL that allows for TDSI creation and show its application for the mentioned website and for the corpus of dialogs with railway office information system.

Chapter Preview

Top

1. Introduction: Retrieval Problems For Digital Library Of Non-Commercial Website

The phenomenon of information overload in Internet means that the access to knowledge available on a website is a problem not only of search engines but also of the owners of the website. Usually, this website contains a large text collection for some restricted domain. The owners of such websites usually are not specialists in computational linguistics and they have neither tools nor time for a high-skilled job of applying natural language processing.

Various studies were carried out by researchers in the field of optimization of search engines. As the result of these efforts, some techniques of documents processing by search engines include working with metadata. It is assumed that metadata creation is done by the website managing team (usually, by the webmaster). Still, when a website contains a large text collection (roughly, more than 100 texts or more than 1,000,000 words) for a restricted topic, it takes a lot of time to create and publish its content for the semantic web. The problem is aggravated by the fact that there are two types of data for each piece of information: one for humans and the other one for computers. Note that the volume of data does not allow using manual processing fulfilled by professional linguists.

The main idea of our research is to perform quantitative analysis of a text collection during the process of its preparation and transformation into a digital library for a given website.

We carry out this analysis by applying heuristic algorithms that do not need large amount of manual data for learning and lead rapidly to useful results as far as both humans and search engines are concerned.

This analysis is useful at the initial stages of creation of a digital library for websites without using an expensive commercial toolkit and without involving a team of linguists. There are other research possibilities in the field of the semantic web; however, we propose the most simple and cheap approach, at least at the initial stage.

Our research was motivated by the necessity of creation of a digital library of the website devoted to the Mixtec culture, but in fact we consider that our solutions of this type of problems can be used in similar situations. At the last phase of the project for the development of a website about the Mixtec culture, a sub-site related to the digital (electronic) library was prepared¹.

The website (www.cumix.org.mx) is devoted to conserving and popularizing the culture of the Mixtec ethnic group. Mixtecs are a national minority of the Southern part of Mexico. The digital library contains all available text documents on history, culture, and modern life of this ethnic group. The website is non-commercial and has free access, as well as its digital library.

There are several problems related to obtaining texts for the website. The first one is free access (without payment and even without registration): authors would not to donate their literary works because in the majority of cases it is their source of living. Another reason is that authors prefer publishing in journals with a high impact factor rather than on the websites. Nevertheless, authors of the documents on non-commercial websites have an advantage of greater accessibility of their materials for the users and of increasing the effectiveness of searches for the materials they placed on their sites. In addition to the materials that have been especially prepared for the site and adapted to the potential users, there is a need to include materials from the collections of text documents that belong to the domain of the site. Usually, these documents are presented on the Internet by free access abstracts.

At the same time, conscientious authors of text materials offered for commercial access on the Internet are interested in better understanding of the content of their materials by the potential user who should be confident and well informed at the moment of making his/her purchase decision. In many cases, the users feel that the abstract is insufficient for the full document representation. Partly it can be explained by the subjectivity of the authors’ approach to writing an abstract.

Complete Chapter List

Search this Book:

Reset

MLA

APA

Chicago

Export Reference

Document Search Images in Text Collections for Restricted Domains on Websites

Abstract

1. Introduction: Retrieval Problems For Digital Library Of Non-Commercial Website

Complete Chapter List