Article Preview
TopText Document Representation
In a text clustering system, similarity between documents is strongly related to the choice of the method of representation of these documents. This representation thus imposes a model of extraction of information. Since 1958, Luhn, one of the pioneers of research in Information Research Systems, established in (Luhn, 1958) the bases of the fundamental assumption of work on the extraction and the selection of information: “the textual contents of a document discriminates the type and the value of information which it conveys”. The near total of current systems base themselves on this principle.
To implement any clustering method, texts must be transformed in an efficient and meaningful way so that they can be analyzed.
The space vector model is the most used approach to represent textual documents. All document dj will be transformed into a vector:dj = (w1j, w2j, ...,w| T |j)where Tis the whole set of terms which appear at least once in the corpus (|T| is the size of the vocabulary), and wkj represents the weight (frequency or importance) of the term tk in the document dj.
There are various methods to calculate the weight wkj knowing that, for each term, it is possible to calculate not only its frequency in the corpus but also the number of documents which contain this term. Most approaches (Sebastiani, 2002) are centered on a vectorial representation of texts using the TFxIDF measure. The frequency TF of a term T in a corpus of textual documents corresponds to the number of occurrences of the term T in the corpus. The frequency IDF of a term T in a corpus of textual documents corresponds to the number of documents containing T. These two concepts are combined (by product) in order to assign a stronger weight to terms that appear often in a document and rarely in the complete corpus:
where
Occ(tk, dj) is the number of occurrences of the term
tk in the document
dj,
Nb_doc is the total number of documents of the corpus and
Nb_doc(tK) is the number of documents of this unit in which the term
tk appears at least once. There is another measurement of weighting called
TFC similar to
TF×IDF which corrects the lengths of the texts by a cosine standardization, to avoid giving more credit to the longest documents: