In this chapter, we present a thesaurus application in the field of text mining and more specifically automatic indexing on the set of descriptors defined by a thesaurus. We begin by presenting various definitions and a mathematical thesaurus model, and also describe various examples of real world thesauri which are used in official institutions. We then explore the problem of thesaurus-based automatic indexing by describing its difficulties and distinguishing features and reviewing previous work in this area. Finally, we propose various lines of future research.
Automated text categorization (Sebastiani, 2002) is a successful subfield of information management and has to date been the subject of more than six hundred research publications during more than forty-five years of work (Sebastiani & Gabrilovich, 2007). Although it is intrinsically a research field (many scientists and engineers attempt to build good text classifiers using different methods) as part of text mining, this discipline can also be helpful for obtaining knowledge such as document summarization, concept or entity extraction, sentiment analysis, or document clustering from documents. Even in the field of information retrieval (van Rijsbergen, 1979) it is interesting for documents to be previously categorized into a list of classes so that they may be retrieved more accurately using associated keywords as meta-information.
Text documents and natural language both share the common problems of lexical ambiguity or polysemy (i.e. words or expressions having more than one meaning and which is a special case of homonymy) and synonymy (different terms or expressions for the same concept). Additionally, the most common representation used for a text document is the bag of words approach (a model similar to “first-order word approximation” used by Shannon in 1948) and this reduces a document to a list of unrelated terms usually resulting in loss of contextual meaning and structure of certain expressions present in the text.
One tool which has been intrinsically designed to avoid ambiguities of any class is a thesaurus. This is a set of terms with orthogonal meanings and a set of the hierarchical relationships between them. A thesaurus can be very useful in different areas of text mining (Berry, 2003) by removing ambiguity and identifying a document’s context. In this chapter, we present some thesaurus basics, a formal characterization, and several examples of real world thesauri. We will then focus on the problem of automatic indexing on a domain of categories defined on a thesaurus.
Key Terms in this Chapter
Document Indexing: Is the act of describing a document by index terms to indicate, with that metadata, what the document is about or to summarize its content. The index terms are often selected from some form of controlled vocabulary, e.g. a thesaurus.
Text Categorization: Is the task of assigning an electronic document to one or more categories, based on its contents. If only one category is assigned, we refer to that as single-label categorization. If several categories can be assigned to the document, we are dealing with multi-label categorization.
Vector Space Model: Is an algebraic model for representing documents (not only text) as vectors of identifiers, such as, for example, index terms. It is used in information filtering, information retrieval, indexing and relevancy rankings. Its first use was in the SMART Information Retrieval System.
Descriptor: A word or a list of words that represents a concept without ambiguity. It can be used to retrieve documents in an information system, for instance a catalog or a search engine.
Thesaurus: A thesaurus is a list of every important descriptors in a given domain of knowledge; and, for each descriptor, the set of descriptor related with it.
Unsupervised Classification: Is a machine learning technique where a model is fit to observations. In this case there is no a priori known output (as in supervised classification). In unsupervised classification, a data set of input objects is partitioned into different groups or clusters, so that the objects in each group share some common trait, e.g. proximity according to some defined distance measure.
Supervised Classification: Is a machine learning technique for creating a function from training data. The training data consist of pairs of input objects (typically vectors), and desired outputs (categories). The task of the supervised learner is to predict the value of the function for any valid input object after having seen a number of training examples (i.e. pairs of input and target output). To achieve this, the learner has to generalize from the presented data to unseen situations in a “reasonable” way.