Implementation and Testing Details of Document Classification

Implementation and Testing Details of Document Classification

DOI: 10.4018/978-1-7998-3772-5.ch010
OnDemand:
(Individual Chapters)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

It is trivial to achieve a recall of 100% by returning all documents in response to any query. Therefore, recall alone is not enough, but one needs to measure the number of non-relevant, for example by computing the precision. The analysis was performed for 30 documents to ensure the stability of precision and recall values. It is observed that the precision of large documents is less than a moderate length document, in the sense that some unimportant keywords get extracted. The reason for this may be attributed to the frequent occurrence and its unimportant role in the sentence.
Chapter Preview
Top

System Testing

Reuters Data Set

Researchers have used benchmark data, such as the Reuters- 21578 corpus of newswire test collection (Sholom M. W., Indurkhya, N., Zhang, T. and Damerau, F. 2010), to measure advances in automated text classification. We performed testing of our system using a sample of the same.

  • Modules of Execution

    • 1.

      Document Entry

    • 2.

      Stop Word removal

    • 3.

      Stemming

    • 4.

      Keyword generation

    • 5.

      Document Classification

  • Document Entry

  • Doc_id : DOC1

  • Doc_content :

The hard problem of the Text Classification usually has various aspects and Potential solutions. Keyword extraction and maximal frequent item set can be used as attributes for mining association rules or as a basis for measuring the similarity of new documents with existing association rules. The issue of keyword extraction from text collection is an emerging research field. It also promotes maximal frequent item set generation.”

Stop Word Removal (Tokenize and Remove Stop Words)

We are using white space as delimiter to tokenize a document string. A tokenized document contains only language-specific alphabets in lower case and all unnecessary characters such as “,” will be removed from the list. Table 1 shows that the tokenization process is not only splitting the words but also changing entire tokenized words into a lowercase format. All tokenized words will then undergo the process of removing stop words.

There are many stop words exist in the above document. To purge them out, a list of predefined stop words must be developed first. The program will then identify and finally remove all the stop words in the document based on the predefined list. Table 2 displays the list of stop words.

Table 1.
Words after tokenization
hard
problem
text
classification
aspects
potential
solution
keyword
extraction
maximal
frequent
item
set
used
attributes
mining
association
rules
basis
measuring
similarity
new
documents
existing
association rules
issue
keyword
extraction
text
collection
emerging
research
filed
promotes
maximal
frequent
item
set
generation
Table 2.
Removed set of stop words
the
of
usually
has
various
and
can
be
a
used
as
or
for
with
is
an
from
it
also

Complete Chapter List

Search this Book:
Reset