In this chapter the authors discuss an application of an immune-based algorithm for extraction and visualization of clusters structure in large collection of documents. Particularly a hierarchical, topic-sensitive approach is proposed; it appears to be a robust solution, both in terms of time and space complexity, to the problem of scalability of document map generation process. The approach relies upon extraction of a hierarchy of concepts, that is almost homogenous groups of documents described by unique sets of terms. To represent the content of each context a modified version the aiNet algorithm is employed; it was chosen because of its flexibility in representing natural clusters existing in a training set. To fasten the learning phase, a smart method of initialization of the immune memory was proposed as well as further modifications of the entire algorithm were introduced. Careful evaluation of the effectiveness of the novel text clustering procedure is presented in section reporting experiments.
Information retrieval is a topic devoted to developing tools providing fast and efficient access to unstructured information in various corporate, scientific and governmental domains, consult e.g. (Manning, Raghavan, & Schütze, 2008). Recent attempts to explain and to model information seeking behavior in humans compare it to food searching activity performed by the animals. A fusion of the optimal foraging theory (developed in the frames of ecological biology) with theories of human cognition resulted in so-called information foraging theory proposed to understand how strategies and technologies for information seeking are (and should be) adapted to a user’s information needs (Pirolli, 2007). One of the most intriguing observations done within this theory is that seeking information, humans follow so-called information scent. If the scent is sufficiently strong, a user will continue to go on that trail, but if the scent is weak, he/she goes back until another satisfactory trace will appear. This process, called three-click rule (Barker, 2005), is repeated usually until the user will be satisfied. If so, Web pages should be equipped with sufficiently strong information scent. This is a lesson for home page designers. Another problem is how to present the content of the home pages, and in general the content of Web resources, to the users with different information needs. WebBook and WebForager are examples of how to implement in an interactive visualization the theories of information foraging (Card, Robertson & York, 1996). Both these systems try to visualize the content of the WWW in a smart way by using the concepts of clustering (i.e. grouping) and visualization of huge collection of data.
The idea of grouping documents takes its roots in so-called Cluster Hypothesis (Rijsbergen, 1979) according to which relevant documents tend to be highly similar to each other (and therefore tend to appear in the same clusters). Thus, it is possible to reduce the number of documents that need to be compared to a given query, as it suffices to match the query against cluster representatives first. In case of documents collections pertaining different themes one can imagine a hierarchical clustering, according to which we identify rough categories first and next we refine these categories into sub-categories, sub-sub-categories, and so-on. However such an approach offers only technical improvement in searching relevant documents, as we obtain something like “nested index” representing the content of the whole collection.
A more radical solution can be gained by using so-called document maps, (Becks, 2001), where a graphical representation allows to convey information about the relationships of individual documents or group of documents. This way, apart of clusters presentation, we gain additional “dimension”: visualization of a certain similarity among the documents. The well-known representative of such formalism is WEBSOM – a system for organizing collection of text documents onto meaningful maps for exploration and search1 (Kohonen et al., 2000). The system uses Kohonen’s (2001) SOM (Self-Organizing Map) algorithm that automatically organizes the documents onto a two-dimensional grid so that related documents appear close to each other. Although it allows analyzing collections of up to one million documents, its main drawback are large time and space complexity what raises questions of scaling and updating of document maps.
The problem of document map creation is closely related to Web mining activity (Chakrabarti, 2002). Its nature can be characterized as extracting nontrivial, previously unknown, and potentially useful information from a given set of Web sites. Document maps are developed in the framework of Information Visualization, or IV for brevity – a process that “aims at an interactive visualization of abstract non-spatial phenomena such as bibliographic data sets, web access patterns, etc.”, (Börner, Chen & Boyack, 2003). The principal idea in IV relies upon displaying inter-document similarity by representing the entire documents collection as 2-dimensional points on a plane in such a way that the relative distance between the points represents similarity between the corresponding documents2.
Key Terms in this Chapter
Top ic: A label (or label vector) describing common characteristics of a subset of documents (e.g. suitability of texts for different age groups). Equivalently it refers to thematic homogeneity of the subset of documents.
Adaptive Clustering: Clustering approach which is able to dynamically modify document representation and similarity measure, on the basis of local contexts discovered in the document collection
Information Visualization: A discipline devoted to the problems of human oriented representation and exploration of large data sets. The tools developed here offer graphical means supporting quick and efficient solutions to these problems.
Competitive Learning: A paradigm used in structured (networked) environments, based on the psychological assumption by Hebb, 1949, that neurons that are stimulated by similar effectors, have stronger functional relationship. In neural clustering models (e.g. SOM, GNG) this assumption means that during training process not only a single neuron weights are modified, but also weights of its neighboring neurons
Context: Sufficiently large (for statistics sake) set of documents with sufficiently uniform topics. Each document can belong to several different contexts, depending on topics it covers
Vector Space Model: One of classical representations of document content. The documents are points (or vectors rooted in coordinate origin) in this high-dimensional space (spanned by terms being coordinate axes), with the point (vector) coordinates reflecting frequencies of different terms (linearly or in a more complex manner) in a given document
Fuzzy Clustering: Clustering which splits data into overlapping clusters, where each object can belong, in some degree, to more than one cluster. Thus each cluster is treated as a fuzzy subset of objects and the membership function defining this subset represents the degrees of membership of each item to the subset. Major algorithm based on fuzzy paradigm is Fuzzy K-Means (Bezdek & Pal, 1992)
Immune-Based Clustering: It exploits immune-based principles of producing antibodies binding antigens. Here the antigens correspond to the input data and antibodies are workable characteristics of the groups of data. In so-called idiotypic network paradigm, antibodies bind not only anti-genes, but also similar antibodies, creating a structure of clusters. It is an example of self-organizing evolutionary algorithm. Contrary to many existing clustering algorithms it does not require prior number of classes and it easily adapts to the problems of incremental learning
Document Map: A 2D map representing relationships among the documents from a given collection. Usually the closer the items on the map surface, the more (semantically) similar their contents