Article Preview
TopIntroduction
Document clustering, unlike document classification, is an unsupervised learning process meaning that there is no known information about documents including the number of document groups (usually called k). Document clustering organizes textual documents into meaningful groups that represent topics in document collections without any known information about a document set. As a result, the documents in a document cluster are similar to one another while documents from different clusters are dissimilar.
Document clustering was originally studied to enhance the performance of information retrieval (IR) because similar documents tend to be relevant to the same user queries (Wang et al., 2002; Zamir & Etzioni, 1998). Document clustering has been used to facilitate nearest-neighbor search (Buckley & Lewit, 1985), to support an interactive document browsing paradigm (Cutting et al., 1992; Gruber, 1993; Koller & Sahami, 1997; Gruber, 1993), and to construct hierarchical topic structures (van Rijsbergen, 1979). Thus, document clustering plays a more important role for IR and text mining communities since the most natural form for storing information is text, and text information has increased exponentially.
In the biomedical domain, document clustering technologies have been used to facilitate the practice of evidence-based medicine. This is because document clustering enhances biomedical literature searching (e.g., MEDLINE searching) in several ways and literature searches are one of the core skills required for the practice of evidence-based medicine (Evidence-based Medicine Working Group, 1992). For example, Pratt and her colleagues (Pratt et al., 1999; Pratt & Fagan, 2000), and Lin and Demner-Fushman (2007) introduced interesting semantic document clustering approaches that automatically cluster biomedical literature (MEDLINE) search results into document groups for the better understanding of literature search results.
Current information technologies allow us to acquire, store, archive, and retrieve documents electronically. To this end, document clustering has been given focal attention because document clustering assists users in discovering hidden similarities and key concepts in documents. One of most serious problems making document clustering difficult to deal with text information is that the size of text collections in digital libraries are increasing rapidly. To handle the increasing size of document collections, a clustering algorithm has to not only solve the incremental problem but it must also have high efficiency in a large dataset.
Most document clustering algorithms require a form of data pre-processing including stop-word removal and feature selection. Through the data pre-processing, unimportant features are eliminated and the original dimension is reduced to a more manageable size. However, the data pre-processing has two problems. First, although the data pre-processing can reduce the original dimension size, the reduced dimension is still sparse, which is called “the curse of dimensionality”. As the result, clustering results are often low quality. Second, the reduction of dimensionality by the data-preprocessing may disturb the preservation of the original topological structure of the input data.