Researchers have used benchmark data, such as the Reuters- 21578 corpus of newswire test collection (Sholom M. W., Indurkhya, N., Zhang, T. and Damerau, F. 2010), to measure advances in automated text classification. We performed testing of our system using a sample of the same.
Modules of Execution
Document Entry
Stop Word removal
Stemming
Keyword generation
Document Classification
Doc_id : DOC1
Doc_content :
“The hard problem of the Text Classification usually has various aspects and Potential solutions. Keyword extraction and maximal frequent item set can be used as attributes for mining association rules or as a basis for measuring the similarity of new documents with existing association rules. The issue of keyword extraction from text collection is an emerging research field. It also promotes maximal frequent item set generation.”
We are using white space as delimiter to tokenize a document string. A tokenized document contains only language-specific alphabets in lower case and all unnecessary characters such as “,” will be removed from the list. Table 1 shows that the tokenization process is not only splitting the words but also changing entire tokenized words into a lowercase format. All tokenized words will then undergo the process of removing stop words.
There are many stop words exist in the above document. To purge them out, a list of predefined stop words must be developed first. The program will then identify and finally remove all the stop words in the document based on the predefined list. Table 2 displays the list of stop words.