SOM-Based Clustering of Textual Documents Using WordNet

SOM-Based Clustering of Textual Documents Using WordNet

Abdelmalek Amine, Zakaria Elberrichi, Michel Simonet, Ladjel Bellatreche, Mimoun Malki
Copyright: © 2009 |Pages: 12
DOI: 10.4018/978-1-59904-990-8.ch012
OnDemand:
(Individual Chapters)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

The classification of textual documents has been the subject of many studies. Technologies like the Web and numerical libraries facilitated the exponential growth of available documentation. The classification of textual documents is very important since it allows the users to effectively and quickly fly over and understand better the contents of large corpora. Most classification approaches use the supervised method of training, more suitable with small corpora and when human experts are available to generate the best classes of data for the training phase, which is not always feasible. The unsupervised classification or “clustering” methods make emerge latent (hidden) classes automatically with minimum human intervention, There are many, and the SOM (self Organized Maps) by Kohonen is one of the algorithms for unsupervised classification that gather a certain number of similar objects in groups without a priori knowledge. This chapter introduces the concept of unsupervised classification of textual documents and proposes an experiment with a conceptual approach for the representation of texts and the method of Kohonen for clustering.
Chapter Preview
Top

State Of The Art

To implement any method on textual document we initially need to represent the documents (Sebastiani, 2002), because there is currently no method of learning able to directly process unstructured data (texts). In a second time it is necessary to choose a similarity measurement, and lastly to choose a clustering algorithm which will develop starting from the chosen descriptors and metric.

Key Terms in this Chapter

Self Organizing Maps of Kohonen: Is a type of artificial neural network that is trained using unsupervised learning to produce a low-dimensional (typically two-dimensional), discretized representation of the input space of the training samples, called a

Reuters-21578: (corpus) Is a set of financial dispatches emitted during the year 1987 by the Reuters agency in the English language and available free on the Web. This corpus is an update of the Reuters-22173 corpus. This update was carried out in 1996. The texts of this corpus have a journalistic style. The characteristic of the corpus Reuters 21578 is that each document is labeled by several classes. This corpus is often used as a comparison base between the various tools for documents classification.

Text Clustering: Is the classification of texts into different groups, or more precisely, the partitioning of a set of texts into subsets (clusters), so that the texts in each subset (ideally) share some common trait - often proximity according to some defined distance measure

WordNet: Is a semantic lexicon for the English language. It groups English words into sets of synonyms called synsets, provides short, general definitions, and records the various semantic relations between these synonym sets. WordNet was created and is being maintained at the Cognitive Science Laboratory of Princeton University under the direction of psychology professor George A. Miller. Development began in 1985

Map: The map seeks to preserve the topological properties of the input space.

Complete Chapter List

Search this Book:
Reset