Document Search Images in Text Collections for Restricted Domains on Websites

Document Search Images in Text Collections for Restricted Domains on Websites

Pavel Makagonov (Mixtec Technological University, Mexico), Celia B.Reyes E. (Celia B. Reyes E.Mixtec Technological University, Mexico) and Grigori Sidorov (National Polytechnic Institute, Mexico)
DOI: 10.4018/978-1-60960-881-1.ch009
OnDemand PDF Download:
No Current Special Offers


The main idea of the authors’ research is to perform quantitative analysis of a text collection during the process of its preparation and transformation into a digital library for a website. They use as a case study the digital library of the website on Mixtec culture that we maintain. The authors propose using the concept of the text document search image (TDSI). For creating TDSIs they make analysis of word frequencies in the documents and distinguish between the Zipf’s distribution that is typical for meaningful words and distributions approximated by an ellipse typical for auxiliary words. The authors also describe some analogies of these distributions in architecture and in urban planning. We describe a toolkit DDL that allows for TDSI creation and show its application for the mentioned website and for the corpus of dialogs with railway office information system.
Chapter Preview

1. Introduction: Retrieval Problems For Digital Library Of Non-Commercial Website

The phenomenon of information overload in Internet means that the access to knowledge available on a website is a problem not only of search engines but also of the owners of the website. Usually, this website contains a large text collection for some restricted domain. The owners of such websites usually are not specialists in computational linguistics and they have neither tools nor time for a high-skilled job of applying natural language processing.

Various studies were carried out by researchers in the field of optimization of search engines. As the result of these efforts, some techniques of documents processing by search engines include working with metadata. It is assumed that metadata creation is done by the website managing team (usually, by the webmaster). Still, when a website contains a large text collection (roughly, more than 100 texts or more than 1,000,000 words) for a restricted topic, it takes a lot of time to create and publish its content for the semantic web. The problem is aggravated by the fact that there are two types of data for each piece of information: one for humans and the other one for computers. Note that the volume of data does not allow using manual processing fulfilled by professional linguists.

The main idea of our research is to perform quantitative analysis of a text collection during the process of its preparation and transformation into a digital library for a given website.

We carry out this analysis by applying heuristic algorithms that do not need large amount of manual data for learning and lead rapidly to useful results as far as both humans and search engines are concerned.

This analysis is useful at the initial stages of creation of a digital library for websites without using an expensive commercial toolkit and without involving a team of linguists. There are other research possibilities in the field of the semantic web; however, we propose the most simple and cheap approach, at least at the initial stage.

Our research was motivated by the necessity of creation of a digital library of the website devoted to the Mixtec culture, but in fact we consider that our solutions of this type of problems can be used in similar situations. At the last phase of the project for the development of a website about the Mixtec culture, a sub-site related to the digital (electronic) library was prepared1.

The website ( is devoted to conserving and popularizing the culture of the Mixtec ethnic group. Mixtecs are a national minority of the Southern part of Mexico. The digital library contains all available text documents on history, culture, and modern life of this ethnic group. The website is non-commercial and has free access, as well as its digital library.

There are several problems related to obtaining texts for the website. The first one is free access (without payment and even without registration): authors would not to donate their literary works because in the majority of cases it is their source of living. Another reason is that authors prefer publishing in journals with a high impact factor rather than on the websites. Nevertheless, authors of the documents on non-commercial websites have an advantage of greater accessibility of their materials for the users and of increasing the effectiveness of searches for the materials they placed on their sites. In addition to the materials that have been especially prepared for the site and adapted to the potential users, there is a need to include materials from the collections of text documents that belong to the domain of the site. Usually, these documents are presented on the Internet by free access abstracts.

At the same time, conscientious authors of text materials offered for commercial access on the Internet are interested in better understanding of the content of their materials by the potential user who should be confident and well informed at the moment of making his/her purchase decision. In many cases, the users feel that the abstract is insufficient for the full document representation. Partly it can be explained by the subjectivity of the authors’ approach to writing an abstract.

Complete Chapter List

Search this Book: