SOM-Based Clustering of Multilingual Documents Using an Ontology
Minh Hai Pham (Swiss Federal Institute of Technology, Switzerland), Delphine Bernhard (Laboratoire TIMC-IMAG, France), Gayo Diallo (Laboratoire TIMC-IMAG, France), Radja Messai (Laboratoire TIMC-IMAG, France) and Michel Simonet (Laboratoire TIMC-IMAG, France)
Copyright: © 2008
Clustering similar documents is a difficult task for text data mining. Difficulties stem especially from the way documents are translated into numerical vectors. In this paper, we will present a method which uses Self Organizing Map (SOM) to cluster medical documents. The originality of the method is that it does not rely on the words shared by documents but rather on concepts taken from an ontology. Our goal is to cluster various medical documents in thematically consistent groups (e.g. grouping all the documents related to cardiovascular diseases). Before applying the SOM algorithm, documents have to go through several pre-processing steps. First, textual data have to be extracted from the documents, which can be either in the PDF or HTML format. Documents are then indexed, using two kinds of indexing units: stems and concepts. After indexing, documents can be numerically represented by vectors whose dimensions correspond to indexing units. These vectors store the weight of the indexing unit within the document they represent. They are given as inputs to a SOM which arranges the corresponding documents on a two-dimensional map. We have compared the results for two indexing schemes: stem-based indexing and conceptual indexing. We will show that using an ontology for document clustering has several advantages. It is possible to cluster documents written in several languages since concepts are language-independent. This is especially helpful in the medical domain where research articles are written in different languages. Another advantage is that the use of concepts helps reduce the size of the vectors, which, in turn, reduces processing time.