SOM-Based Clustering of Multilingual Documents Using an Ontology

SOM-Based Clustering of Multilingual Documents Using an Ontology

Minh Hai Pham (Swiss Federal Institute of Technology, Switzerland), Delphine Bernhard (Laboratoire TIMC-IMAG, France), Gayo Diallo (Laboratoire TIMC-IMAG, France), Radja Messai (Laboratoire TIMC-IMAG, France) and Michel Simonet (Laboratoire TIMC-IMAG, France)
DOI: 10.4018/978-1-59904-618-1.ch004
OnDemand PDF Download:


Clustering similar documents is a difficult task for text data mining. Difficulties stem especially from the way documents are translated into numerical vectors. In this paper, we will present a method which uses Self Organizing Map (SOM) to cluster medical documents. The originality of the method is that it does not rely on the words shared by documents but rather on concepts taken from an ontology. Our goal is to cluster various medical documents in thematically consistent groups (e.g. grouping all the documents related to cardiovascular diseases). Before applying the SOM algorithm, documents have to go through several pre-processing steps. First, textual data have to be extracted from the documents, which can be either in the PDF or HTML format. Documents are then indexed, using two kinds of indexing units: stems and concepts. After indexing, documents can be numerically represented by vectors whose dimensions correspond to indexing units. These vectors store the weight of the indexing unit within the document they represent. They are given as inputs to a SOM which arranges the corresponding documents on a two-dimensional map. We have compared the results for two indexing schemes: stem-based indexing and conceptual indexing. We will show that using an ontology for document clustering has several advantages. It is possible to cluster documents written in several languages since concepts are language-independent. This is especially helpful in the medical domain where research articles are written in different languages. Another advantage is that the use of concepts helps reduce the size of the vectors, which, in turn, reduces processing time.

Complete Chapter List

Search this Book:
Table of Contents
Hector Oscar Nigro, Sandra Elizabeth Gonzalez Cisaro, Daniel Hugo Xodo
Chapter 1
Sofia Stamou, Alexandros Ntoulas, Dimitris Christodoulakis
In this paper we study how we can organize the continuously proliferating Web content into topical cate-gories, also known as Web directories. In... Sample PDF
TODE: An Ontology-Based Model for the Dynamic Population of Web Directories
Chapter 2
Xuan Zhou, James Geller
This chapter introduces Raising as an operation which is used as a pre-processing step for Data Mining. In the Web Marketing Project, people’s... Sample PDF
Raising, to Enhance Rule Mining in Web Marketing with the Use of an Ontology
Chapter 3
Brigitte Trousse, Marie-Aude Aufaure, Bénédicte Le Grand, Yves Lechevallier, Florent Masseglia
This chapter proposes an original approach for ontology management in the context of Web-based information systems. Our approach relies on the usage... Sample PDF
Web Usage Mining for Ontology Management
Chapter 4
Minh Hai Pham, Delphine Bernhard, Gayo Diallo, Radja Messai, Michel Simonet
Clustering similar documents is a difficult task for text data mining. Difficulties stem especially from the way documents are translated into... Sample PDF
SOM-Based Clustering of Multilingual Documents Using an Ontology
Chapter 5
Ana Isabel Canhoto
The use of automated systems to collect, process and analyse vast amounts of data is now integral to the operations of many corporations and... Sample PDF
Ontology-Based Interpretation and Validation of Mined Knowledge: Normative and Cognitive Factors in Data Mining
Chapter 6
Amandeep S. Sidhu, Tharam S. Dillon, Elizabeth Chang
Traditional approaches to integrate protein data generally involved keyword searches, which immediately excludes unannotated or poorly annotated... Sample PDF
Data Integration Through Protein Ontology
Chapter 7
Josiane Mothe, Nathalie Hernandez
This chapter introduces a method re-using a thesaurus built for a given domain, in order to create new resources of a higher semantic level in the... Sample PDF
TtoO: Mining a Thesaurus and Texts to Build and Update a Domain Ontology
Chapter 8
Stanley Loh, Daniel Lichtnow, Thyago Borges, Gustavo Piltcher
This chapter investigates different aspects in the construction of a domain ontology to a content-based recommender system. The recommender systems... Sample PDF
Evaluating the Construction of Domain Ontologies for Recommender Systems Based on Texts
Chapter 9
Vania Bogorny, Paulo Martins Engel, Luis Otavio Alavares
This chapter introduces the problem of mining frequent geographic patterns and spatial association rules from geographic databases. In the... Sample PDF
Enhancing the Process of Knowledge Discovery in Geographic Databases Using Geo-Ontologies
Chapter 10
Peter Brezany, Ivan Janciak, A Min Tjoa
This chapter introduces an ontology-based framework for automated construction of complex interactive data mining workflows as a means of improving... Sample PDF
Ontology-Based Construction of Grid Data Mining Workflows
Chapter 11
Shastri L. Nimmagadda, Heinz Dreher
Several issues of database organization of petroleum industries have been highlighted. Complex geo-spatial heterogeneous data structures complicate... Sample PDF
Ontology-Based Data Warehousing and Mining Approaches in Petroleum Industries
Chapter 12
Evangelos Kotsifakos, Gerasimos Marketos, Yannis Theodoridis
Pattern Base Management Systems (PBMS) have been introduced as an effective way to manage the high volume of patterns available nowadays. PBMS... Sample PDF
A Framework for Integrating Ontologies and Pattern-Bases
About the Contributors