Applying the Immunological Network Concept to Clustering Document Collections

Applying the Immunological Network Concept to Clustering Document Collections

Krzysztof Ciesielski (Polish Academy of Sciences, Poland), Mieczyslaw A. Klopotek (Polish Academy of Sciences, Poland) and Slawomir T. Wierzchon (Polish Academy of Sciences, Poland & University of Gdansk, Poland)
DOI: 10.4018/978-1-60566-310-4.ch008
OnDemand PDF Download:


In this chapter the authors discuss an application of an immune-based algorithm for extraction and visualization of clusters structure in large collection of documents. Particularly a hierarchical, topic-sensitive approach is proposed; it appears to be a robust solution, both in terms of time and space complexity, to the problem of scalability of document map generation process. The approach relies upon extraction of a hierarchy of concepts, that is almost homogenous groups of documents described by unique sets of terms. To represent the content of each context a modified version the aiNet algorithm is employed; it was chosen because of its flexibility in representing natural clusters existing in a training set. To fasten the learning phase, a smart method of initialization of the immune memory was proposed as well as further modifications of the entire algorithm were introduced. Careful evaluation of the effectiveness of the novel text clustering procedure is presented in section reporting experiments.
Chapter Preview

1. Introduction

Information retrieval is a topic devoted to developing tools providing fast and efficient access to unstructured information in various corporate, scientific and governmental domains, consult e.g. (Manning, Raghavan, & Schütze, 2008). Recent attempts to explain and to model information seeking behavior in humans compare it to food searching activity performed by the animals. A fusion of the optimal foraging theory (developed in the frames of ecological biology) with theories of human cognition resulted in so-called information foraging theory proposed to understand how strategies and technologies for information seeking are (and should be) adapted to a user’s information needs (Pirolli, 2007). One of the most intriguing observations done within this theory is that seeking information, humans follow so-called information scent. If the scent is sufficiently strong, a user will continue to go on that trail, but if the scent is weak, he/she goes back until another satisfactory trace will appear. This process, called three-click rule (Barker, 2005), is repeated usually until the user will be satisfied. If so, Web pages should be equipped with sufficiently strong information scent. This is a lesson for home page designers. Another problem is how to present the content of the home pages, and in general the content of Web resources, to the users with different information needs. WebBook and WebForager are examples of how to implement in an interactive visualization the theories of information foraging (Card, Robertson & York, 1996). Both these systems try to visualize the content of the WWW in a smart way by using the concepts of clustering (i.e. grouping) and visualization of huge collection of data.

The idea of grouping documents takes its roots in so-called Cluster Hypothesis (Rijsbergen, 1979) according to which relevant documents tend to be highly similar to each other (and therefore tend to appear in the same clusters). Thus, it is possible to reduce the number of documents that need to be compared to a given query, as it suffices to match the query against cluster representatives first. In case of documents collections pertaining different themes one can imagine a hierarchical clustering, according to which we identify rough categories first and next we refine these categories into sub-categories, sub-sub-categories, and so-on. However such an approach offers only technical improvement in searching relevant documents, as we obtain something like “nested index” representing the content of the whole collection.

A more radical solution can be gained by using so-called document maps, (Becks, 2001), where a graphical representation allows to convey information about the relationships of individual documents or group of documents. This way, apart of clusters presentation, we gain additional “dimension”: visualization of a certain similarity among the documents. The well-known representative of such formalism is WEBSOM – a system for organizing collection of text documents onto meaningful maps for exploration and search1 (Kohonen et al., 2000). The system uses Kohonen’s (2001) SOM (Self-Organizing Map) algorithm that automatically organizes the documents onto a two-dimensional grid so that related documents appear close to each other. Although it allows analyzing collections of up to one million documents, its main drawback are large time and space complexity what raises questions of scaling and updating of document maps.

The problem of document map creation is closely related to Web mining activity (Chakrabarti, 2002). Its nature can be characterized as extracting nontrivial, previously unknown, and potentially useful information from a given set of Web sites. Document maps are developed in the framework of Information Visualization, or IV for brevity – a process that “aims at an interactive visualization of abstract non-spatial phenomena such as bibliographic data sets, web access patterns, etc.”, (Börner, Chen & Boyack, 2003). The principal idea in IV relies upon displaying inter-document similarity by representing the entire documents collection as 2-dimensional points on a plane in such a way that the relative distance between the points represents similarity between the corresponding documents2.

Key Terms in this Chapter

Top ic: A label (or label vector) describing common characteristics of a subset of documents (e.g. suitability of texts for different age groups). Equivalently it refers to thematic homogeneity of the subset of documents.

Adaptive Clustering: Clustering approach which is able to dynamically modify document representation and similarity measure, on the basis of local contexts discovered in the document collection

Information Visualization: A discipline devoted to the problems of human oriented representation and exploration of large data sets. The tools developed here offer graphical means supporting quick and efficient solutions to these problems.

Competitive Learning: A paradigm used in structured (networked) environments, based on the psychological assumption by Hebb, 1949, that neurons that are stimulated by similar effectors, have stronger functional relationship. In neural clustering models (e.g. SOM, GNG) this assumption means that during training process not only a single neuron weights are modified, but also weights of its neighboring neurons

Context: Sufficiently large (for statistics sake) set of documents with sufficiently uniform topics. Each document can belong to several different contexts, depending on topics it covers

Vector Space Model: One of classical representations of document content. The documents are points (or vectors rooted in coordinate origin) in this high-dimensional space (spanned by terms being coordinate axes), with the point (vector) coordinates reflecting frequencies of different terms (linearly or in a more complex manner) in a given document

Fuzzy Clustering: Clustering which splits data into overlapping clusters, where each object can belong, in some degree, to more than one cluster. Thus each cluster is treated as a fuzzy subset of objects and the membership function defining this subset represents the degrees of membership of each item to the subset. Major algorithm based on fuzzy paradigm is Fuzzy K-Means (Bezdek & Pal, 1992)

Immune-Based Clustering: It exploits immune-based principles of producing antibodies binding antigens. Here the antigens correspond to the input data and antibodies are workable characteristics of the groups of data. In so-called idiotypic network paradigm, antibodies bind not only anti-genes, but also similar antibodies, creating a structure of clusters. It is an example of self-organizing evolutionary algorithm. Contrary to many existing clustering algorithms it does not require prior number of classes and it easily adapts to the problems of incremental learning

Document Map: A 2D map representing relationships among the documents from a given collection. Usually the closer the items on the map surface, the more (semantically) similar their contents

Complete Chapter List

Search this Book:
Editorial Advisory Board
Table of Contents
Lipo Wang
Hongwei Mo
Chapter 1
Fabio Freschi, Carlos A. Coello Coello, Maurizio Repetto
This chapter aims to review the state of the art in algorithms of multiobjective optimization with artificial immune systems (MOAIS). As it will be... Sample PDF
Multiobjective Optimization and Artificial Immune Systems: A Review
Chapter 2
Jun Chen, Mahdi Mahfouf
The primary objective of this chapter is to introduce Artificial Immune Systems (AIS) as a relatively new bio-inspired optimization technique and to... Sample PDF
Artificial Immune Systems as a Bio-Inspired Optimization Technique and Its Engineering Applications
Chapter 3
Licheng Jiao, Maoguo Gong, Wenping Ma
Many immue-inspired algorithms are based on the abstractions of one or several immunology theories, such as clonal selection, negative selection... Sample PDF
An Artificial Immune Dynamical System for Optimization
Chapter 4
Malgorzata Lucinska, Slawomir T. Wierzchon
Multi-agent systems (MAS), consist of a number of autonomous agents, which interact with one-another. To make such interactions successful, they... Sample PDF
An Immune Inspired Algorithm for Learning Strategies in a Pursuit-Evasion Game
Chapter 5
Luis Fernando Niño Vasquez, Fredy Fernando Muñoz Mopan, Camilo Eduardo Prieto Salazar, José Guillermo Guarnizo Marín
Artificial Immune Systems (AIS) have been widely used in different fields such as robotics, computer science, and multi-agent systems with high... Sample PDF
Applications of Artificial Immune Systems in Agents
Chapter 6
Xingquan Zuo
Inspired from the robust control principle, a robust scheduling method is proposed to solve uncertain scheduling problems. The uncertain scheduling... Sample PDF
An Immune Algorithm Based Robust Scheduling Methods
Chapter 7
Fabio Freschi, Maurizio Repetto
The increasing cost of energy and the introduction of micro-generation facilities and the changes in energy production systems require new... Sample PDF
Artificial Immune System in the Management of Complex Small Scale Cogeneration Systems
Chapter 8
Krzysztof Ciesielski, Mieczyslaw A. Klopotek, Slawomir T. Wierzchon
In this chapter the authors discuss an application of an immune-based algorithm for extraction and visualization of clusters structure in large... Sample PDF
Applying the Immunological Network Concept to Clustering Document Collections
Chapter 9
Xiangrong Zhang, Fang Liu
The problem of feature selection is fundamental in various tasks like classification, data mining, image processing, conceptual learning, and so on.... Sample PDF
Feature Selection Based on Clonal Selection Algorithm: Evaluation and Application
Chapter 10
Yong-Sheng Ding, Xiang-Feng Zhang, Li-Hong Ren
Future Internet should be capable of extensibility, survivability, mobility, and adaptability to the changes of different users and network... Sample PDF
Immune Based Bio-Network Architecture and its Simulation Platform for Future Internet
Chapter 11
Tao Gong
Static Web immune system is an important applicatiion of artificial immune system, and it is also a good platform to develop new immune computing... Sample PDF
A Static Web Immune System and Its Robustness Analysis
Chapter 12
Alexander O. Tarakanov
Based on mathematical models of immunocomputing, this chapter describes an approach to spatio-temporal forecast (STF) by intelligent signal... Sample PDF
Immunocomputing for Spatio-Temporal Forecast
Chapter 13
Fu Dongmei
In engineering application, the characteristics of the control system are entirely determined by the system controller once the controlled object... Sample PDF
Research of Immune Controllers
Chapter 14
Xiaojun Bi
In fact, image segmentation can be regarded as a constrained optimization problem, and a series of optimization strategies can be used to complete... Sample PDF
Immune Programming Applications in Image Segmentation
Chapter 15
Xin Wang, Wenjian Luo, Zhifang Li, Xufa Wang
A hardware immune system for the error detection of MC8051 IP core is designed in this chapter. The binary string to be detected by the hardware... Sample PDF
A Hardware Immune System for MC8051 IP Core
Chapter 16
Mark Burgin, Eugene Eberbach
There are different models of evolutionary computations: genetic algorithms, genetic programming, etc. This chapter presents mathematical... Sample PDF
On Foundations of Evolutionary Computation: An Evolutionary Automata Approach
Chapter 17
Terrence P. Fries
Path planning is an essential component in the control software for an autonomous mobile robot. Evolutionary strategies are employed to determine... Sample PDF
Evolutionary Path Planning for Robot Navigation Under Varying Terrain Conditions
Chapter 18
Konstantinos Konstantinidis, Georgios Ch. Sirakoulis, Ioannis Andreadis
The aim of this chapter is to provide the reader with a Content Based Image Retrieval (CBIR) system which incorporates AI through ant colony... Sample PDF
Ant Colony Optimization for Use in Content Based Image Retrieval
Chapter 19
Miroslav Bursa, Lenka Lhotska
The chapter concentrates on the use of swarm intelligence in data mining. It focuses on the problem of medical data clustering. Clustering is a... Sample PDF
Ant Colonies and Data Mining
Chapter 20
Bo-Suk Yang
This chapter describes a hybrid artificial life optimization algorithm (ALRT) based on emergent colonization to compute the solutions of global... Sample PDF
Artificial Life Optimization Algorithm and Applications
Chapter 21
Martin Macaš, Lenka Lhotská
A novel binary optimization technique is introduced called Social Impact Theory based Optimizer (SITO), which is based on social psychology model of... Sample PDF
Optimizing Society: The Social Impact Theory Based Optimizer
Chapter 22
James F. Peters, Shabnam Shahfar
The problem considered in this chapter is how to use the observed behavior of organisms as a basis for machine learning. The proposed approach for... Sample PDF
Ethology-Based Approximate Adaptive Learning: A Near Set Approach
Chapter 23
Dingju Zhu
Parallel computing is more and more important for science and engineering, but it is not used so widely as serial computing. People are used to... Sample PDF
Nature Inspired Parallel Computing
Chapter 24
Tang Mo, Wang Kejun, Zhang Jianmin, Zheng Liying
An understanding of the human brain’s local function has improved in recent years. But the cognition of human brain’s working process as a whole is... Sample PDF
Fuzzy Chaotic Neural Networks
About the Contributors