Does 3D Cellular Automata Using WordNet Can Improve Text Clustering?

Does 3D Cellular Automata Using WordNet Can Improve Text Clustering?

Abdelmalek Amine, Reda Mohamed Hamou, Michel Simonet
Copyright: © 2014 |Pages: 13
DOI: 10.4018/IJDLS.2014070102
OnDemand:
(Individual Articles)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

With the abundance of text documents available through the Web and digital libraries. Text clustering still plays an important role in information retrieval and applications like search result grouping and categorization, topic extraction and content filtering. Several clustering methods have been applied to textual documents where results depend on the application and on the representation of text. In this paper, an experiment of 3D Cellular Automata approach using WordNet for text representation is proposed and compared with some text clustering algorithms to see if this approach provides best results.
Article Preview
Top

Text Document Representation

In a text clustering system, similarity between documents is strongly related to the choice of the method of representation of these documents. This representation thus imposes a model of extraction of information. Since 1958, Luhn, one of the pioneers of research in Information Research Systems, established in (Luhn, 1958) the bases of the fundamental assumption of work on the extraction and the selection of information: “the textual contents of a document discriminates the type and the value of information which it conveys”. The near total of current systems base themselves on this principle.

To implement any clustering method, texts must be transformed in an efficient and meaningful way so that they can be analyzed.

The space vector model is the most used approach to represent textual documents. All document dj will be transformed into a vector:dj = (w1j, w2j, ...,w| T |j)where Tis the whole set of terms which appear at least once in the corpus (|T| is the size of the vocabulary), and wkj represents the weight (frequency or importance) of the term tk in the document dj.

There are various methods to calculate the weight wkj knowing that, for each term, it is possible to calculate not only its frequency in the corpus but also the number of documents which contain this term. Most approaches (Sebastiani, 2002) are centered on a vectorial representation of texts using the TFxIDF measure. The frequency TF of a term T in a corpus of textual documents corresponds to the number of occurrences of the term T in the corpus. The frequency IDF of a term T in a corpus of textual documents corresponds to the number of documents containing T. These two concepts are combined (by product) in order to assign a stronger weight to terms that appear often in a document and rarely in the complete corpus:

IJDLS.2014070102.m01
where Occ(tk, dj) is the number of occurrences of the term tk in the document dj, Nb_doc is the total number of documents of the corpus and Nb_doc(tK) is the number of documents of this unit in which the term tk appears at least once. There is another measurement of weighting called TFC similar to TF×IDF which corrects the lengths of the texts by a cosine standardization, to avoid giving more credit to the longest documents:

Complete Article List

Search this Journal:
Reset
Volume 5: 2 Issues (2015)
Volume 4: 2 Issues (2014)
Volume 3: 4 Issues (2012)
Volume 2: 4 Issues (2011)
Volume 1: 4 Issues (2010)
View Complete Journal Contents Listing