N-Clustering of Text Documents Using Graph Mining Techniques

Bapuji Rao
DOI: 10.4018/978-1-7998-3479-3.ch057
The chapter is about the clustering of text documents based on the input of the n-number of words on the m-number of text documents using graph mining techniques. The author has proposed an algorithm for clustering of text documents by inputting n-number of words on m-number of text documents. First of all the proposed algorithm starts the selection of documents with extension name “.txt” from m-numbers of documents having various types of extension names. The n-number of words are input on the selected “.txt” documents, the algorithm starts n-clustering of text documents based on an n-input word. This is possible by way of creation of a document-word frequency matrix in the memory. Then the frequency-word table is converted into the un-oriented document-word incidence matrix by replacing all non-zeros with 1s. Using the un-oriented document-word incidence matrix, the algorithm starts the creation of n-number of clusters of text documents having the presence of words ranging from 1 to n respectively. Finally, these n-clusters based on word-wise as well as 1 to n word-wise.
Literature Survey

The Scatter-Gather method proposed by authors (Cutting, Karger, Pedersen, & Tukey, 1992) defines the hierarchical organization of documents into coherent categories for systematic browsing of the document collection. It provides a systematic browsing technique with the use of clustered organization of the document collection.

Key Terms in this Chapter

Un-Oriented Documents-Words Incidence Matrix: A matrix consists of only 1 which is the indication of presence of word in documents.

Bi-Partite Graph: A graph with two types of nodes viz. Word node and Document node. The relationship between the word and document is an edge which is the indication of word in the related document.

Cluster: The set of related documents which consist of a particular word.

Documents-Words Frequency Matrix: A matrix which consists of only the words frequency in documents.

