Text Mining Methods for Hierarchical Document Indexing

Text Mining Methods for Hierarchical Document Indexing

Han-Joon Kim
Copyright: © 2009 |Pages: 9
DOI: 10.4018/978-1-60566-010-3.ch299
OnDemand:
(Individual Chapters)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

We have recently seen a tremendous growth in the volume of online text documents from networked resources such as the Internet, digital libraries, and company-wide intranets. One of the most common and successful methods of organizing such huge amounts of documents is to hierarchically categorize documents according to topic (Agrawal, Bayardo & Srikant, 2000; Kim & Lee, 2003). The documents indexed according to a hierarchical structure (termed ‘topic hierarchy’ or ‘taxonomy’) are kept in internal categories as well as in leaf categories, in the sense that documents at a lower category have increasing specificity. Through the use of a topic hierarchy, users can quickly navigate to any portion of a document collection without being overwhelmed by a large document space. As is evident from the popularity of web directories such as Yahoo (http:// www.yahoo.com/) and Open Directory Project (http:// www.dmoz.org/), topic hierarchies have increased in importance as a tool for organizing or browsing a large volume of electronic text documents. Currently, the topic hierarchies maintained by most information systems are manually constructed and maintained by human editors. The topic hierarchy should be continuously subdivided to cope with the high rate of increase in the number of electronic documents. For example, the topic hierarchy of the Open Directory Project has now reached about 590,000 categories. However, manually maintaining the hierarchical structure incurs several problems. First, such a manual task is prohibitively costly as well as time-consuming. Until now, large search portals such as Yahoo have invested significant time and money into maintaining their taxonomy, but obviously they will not be able to keep up with the pace of growth and change in electronic documents through such manual activity. Moreover, for a dynamic networked resource (e.g., World Wide Web) that contains highly heterogeneous documents accompanied by frequent content changes, maintain- ing a ‘good’ hierarchy is fraught with difficulty, and oftentimes is beyond the human experts’ capabilities. Lastly, since human editors’ categorization decision is not only highly subjective but their subjectivity is also variable over time, it is difficult to maintain a reliable and consistent hierarchical structure. The above limitations require information systems that can provide intelligent organization capabilities with topic hierarchies. Related commercial systems include Verity Knowledge Organizer (http://www.verity.com/), Inktomi Directory Engine (http://www.inktomi.com/), and Inxight Categorizer (http://www.inxight.com/), which enable a browsable web directory to be automatically built. However, these systems did not address the (semi-)automatic evolving capabilities of organizational schemes and classification models at all. This is one of the reasons why the commercial taxonomy-based services do not tend to be as popular as their manually constructed counterparts, such as Yahoo.
Chapter Preview
Top

Background

In future systems, it will be necessary for users to be able to easily manipulate the hierarchical structure and the placement of documents within it (Aggarwal, Gates & Yu, 1999; Agrawal, Bayardo & Srikant, 2000). In this regard, this section presents three critical requirements for intelligent taxonomy construction, and taxonomy construction process using text-mining techniques.

Complete Chapter List

Search this Book:
Reset