Some Efficient and Fast Approaches to Document Clustering

Some Efficient and Fast Approaches to Document Clustering

P. Viswanth (Indian Institute of Technology Guwahati, India)
Copyright: © 2009 |Pages: 8
DOI: 10.4018/978-1-59904-990-8.ch011
OnDemand PDF Download:
No Current Special Offers


Clustering is a process of finding natural grouping present in a dataset. Various clustering methods are proposed to work with various types of data. The quality of the solution as well as the time taken to derive the solution is important when dealing with large datasets like that in a typical documents database. Recently hybrid and ensemble based clustering methods are shown to yield better results than conventional methods. The chapter proposes two clustering methods; one is based on a hybrid scheme and the other based on an ensemble scheme. Both of these are experimentally verified and are shown to yield better and faster results.
Chapter Preview


Clustering is a process of finding groups called clusters present in a given data set such that the data items present in a cluster are similar to each other, whereas those present in different clusters are dissimilar. There are various clustering methods applied in various fields which use various similarity measures (Jain, Murty & Flynn, 1999). Even though the problem seems simple and a relatively older one, it is still an active research area, and recently it is shown that there is no clustering method which satisfies certain simple properties (Kleinberg, 2002). A good clustering method in one field need not be a good one in some other field.

Key Terms in this Chapter

Partition of a Dataset: A collection of subsets of the dataset such that every pair of distinct subsets are disjoint and the union of the collection is equal to the dataset.

Prototype: A representative pattern from the given dataset.

Consensus Function: It is a mapping from a set of solutions (which might be intermediate solutions) to a single final solution. The solution can be a clustering of a dataset, a classification decision by a classifier, etc,.

Clustering of a Dataset: A collection of subsets of the dataset so that their union is equal to the dataset.

Document: A sequence of words. When each word is seen as a feature and frequency of the word as the feature value, a document can be represented as a vector of frequencies which can be seen as a pattern (see the definition of pattern below).

Pattern: An object either physical or abstract which can be represented using a set of feature values. Normally a pattern is seen as a point in a feature space.

Complete Chapter List

Search this Book: