Design and Development of a Taxonomy Generator: A Case Example for Greenstone

Design and Development of a Taxonomy Generator: A Case Example for Greenstone

Yin-Leng Theng (Nanyang Technological University, Singapore), Nyein Chan Lwin Lwin (Nanyang Technological University, Singapore), Jin-Cheon Na (Nanyang Technological University, Singapore), Schubert Foo (Nanyang Technological University, Singapore) and Dion Hoe-Lian Goh (Nanyang Technological University, Singapore)
DOI: 10.4018/978-1-59904-879-6.ch008
OnDemand PDF Download:


This chapter addresses the issues of resource discovery in digital libraries (DLs) and the importance of knowledge organization tools in building DLs. Using the Greenstone digital library (GSDL) software as a case example, we describe a taxonomy generation tool (TGT) prototype, a hierarchical classification of contents module, designed and built to categorize contents within DLs. TGT was developed as a desktop application using Microsoft .NET Framework 2.0 in Visual C# language and object-oriented programming. In TGT, Z39.19 was implemented providing standard guidelines to construct, format, and manage monolingual controlled vocabularies, usage of broader terms, narrower terms and related terms as well as their semantic relationships, and the simple knowledge organization system (SKOS) for vocabulary specification. The XML schema definition was designed to validate against rules developed for the XML taxonomy template, hence, resulting in the generated taxonomy template supporting controlled vocabulary terms as well as allowing users to select the labels for the taxonomy structure. A pilot user study was then conducted to evaluate the usability and usefulness of TGT and the taxonomy template. In this study, we observed four subjects using TGT, followed by a focus group for comments. Initial feedback was positive, indicating the importance of having a taxonomy structure in GSDL. Recommendations for future work include content classification and metadata technologies in TGT.
Chapter Preview

Overview Of Greenstone

Greenstone is a software suite designed to build and distribute DL collections for publishing on the Internet or on CD-ROM. It is an open-source application developed under the terms of the general public license (GNU) and is particularly easy to install and use (Witten, 2003). In cooperation with UNESCO and Human Info, Greenstone has helped to support user testing, internationalization, and mount courses (Witten & Bainbridge, 2005). Aligning with the goal of UNESCO for the preservation and distribution of educational, scientific, and cultural information of developing countries, Greenstone came in as an important tool in this context. The core facilities aiming to provide in Greenstone were for designing and construction of the document collections, distributing them on the Web and/or CD-ROM, as well as to providing customizable structure on available metadata, easy-to-use collection-building interface, multilingual support, and multiplatform operation (Witten & Bainbridge, 2005). Although initially focused on helping developing countries, its user base has expanded to 70 countries and the reader’s interface has been translated into 45 languages to-date, with increasing volume of download hits from a steady 4,500 times per month to 6,500 over the last 2 years (Witten & Bainbridge, 2007). Greenstone’s popularity comes from a simple, user-friendly interface providing:

Key Terms in this Chapter

Greenstone: Greenstone ( is produced under the New Zealand Digital Library Project, a research project for text compression at University of Waikato. It focuses on personalization and construction of the digital collection from end-user perspectives.

Usefulness: This is debatable. Some make the distinction between usability and usefulness. Although it is impossible to quantify the usefulness of a system, attempts have been made to measure its attainment in reference to system specifications and the extent of coverage of end users’ tasks supported by the system, but not on end user performance testing.

Usability: ISO 9241-11 defines usability as “the extent to which a product can be used by specified users to achieve specified goals with effectiveness, efficiency and satisfaction in a specified context of use.” Usability of hypertext/Web is commonly measured using established usability dimensions covering categories of usability defects such as screen design, terminology and system information, system capabilities and user control, navigation, and completing tasks.

Fedora: It was originally implemented as a DARPA and NSF funded research project at Cornell University and later funded by the Andrew W. Mellon foundation. Fedora ( offers a service-oriented architecture by providing a powerful digital object model which supports multiple views for digital objects.

DSpace: It is jointly implemented by Massachusetts Institute of Technology (MIT) and Hewlett-Packard (HP) laboratories and was released in November 2002. DSpace (see aims to provide a digital institutional repository system to capture, store, index, preserve, and redistribute an organization’s research data.

Metadata: A set of attributes that describes the content, quality, condition, and other characteristics of a resource.

Digital Libraries: They mean different things to different people. The design of digital libraries is, therefore, dependent of the perceptions of the purpose/functionality of digital libraries. To the library science community, the roles of traditional libraries are to: (a) provide access to information in any format that has been evaluated, organized, archived, and preserved; (b) have information professionals that make judgments and interpret users’ needs; and (c) provide services and resources to people (e.g., students, faculty, others, etc.). To the computer science community, digital libraries may refer to a distributed text-based information system, a collection of distributed information services, a distributed space of interlinked information system, or a networked multimedia information system.

Taxonomy: According to the definition by ANSI/NISO (2005), taxonomy is a collection of controlled vocabulary terms organized into a hierarchical structure with each term having one or more parent/child (broader/narrower) relationships to others. It gives a high level view of contents systematically and provides users a roadmap for discovering knowledge available. Taxonomies can appear as lists, trees, hierarchies, polyhierarchies, matrices, facets, or system maps.

Complete Chapter List

Search this Book: