Special Offers
- IGI Global’s New Emerging Topic e-Book Collections
  Acquire highly focused and affordable Cutting-Edge Peer-Reviewed Research Content through a selection of 17 topic-focused e-Book Collections discounted up to 90%, compared to list prices. Collection topics include Artificial Intelligence, Data Science, Language Learning, Marketing and Customer Relations, Sustainability, and many more. Hosted on the InfoSci^® platform, these collections feature no DRM, no additional cost for multi-user licensing, no embargo of content, full-text PDF & HTML format, and more.
  Learn More
- Open Access Book (Free Access) - Encyclopedia of Information Science and Technology, Sixth Edition (ISBN: 9781668473665)
  The Encyclopedia of Information Science and Technology, Sixth Edition) continues the legacy set forth by the first five editions by providing comprehensive coverage and up-to-date definitions of the most important issues, concepts, and trends pertaining to technological advancements and information management within a variety of settings and industries. The entire book is being published under open access.
  Read Now
- Open Access Book (Free Access) - Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries (ISBN: 9781668456293)
  Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries provides information on the recent technology, mitigation, and environmental protection that must be applied for food sustainability in developing countries. This book is being published under Platinum Open Access through funding from Diponegoro University, Indonesia.
  Read Now
- Open Access Book (Free Access) - New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY (ISBN: 9781668438091)
  The Walmart Corporation and the Lumina Foundation have provided funding to make New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY fully open access, completely removing any paywall between scholars in education and the latest research on new models for the future of higher education.
  Read Now
- Open Access Book (Free Access) - Handbook of Research on the Global View of Open Access and Scholarly Communications (ISBN: 9781799898054)
  Through a collaboration between IGI Global and the University of North Texas, the Handbook of Research on the Global View of Open Access and Scholarly Communications has been published as fully open access, completely removing any paywall between researchers of any field, and the latest research on the equitable and inclusive nature of Open Access and all of its complications.
  Read Now
Books
- - Books by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Books by Field
Journals
- - Journals
  - OnDemand Journal Articles
  - Journals by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Journals by Field
e-Collections
OnDemand
Open Access
- View All Open Access Opportunities
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Find an Open Access Journal for Your Next Manuscript
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Submit an Open Access Book Proposal
  Learn more about open access book publishing and how it can propel your research forward in the field.
  Convert Your Work to Open Access
  Already published? You can convert your work to open access to increase its impact through IGI Global’s Restrospective Open Access Program.
  Utilize Open Access Collection Database
  Open up your research potential by utilizing our open access content or integrating the open access collection into your library
  Consider Open Access Agreements
  For Libraries: consider no-cost or investment-level open access agreements with IGI Global to support your faculty's research endeavors.
  Search Funding Resources
  Looking for additional funding resources to support your open accesss endeavors? View industry resources compiled by our open access team.
  Review Open Access Policies & Ethical Guidelines
  Considering IGI Global to publish your work under open access? Review IGI Global’s open access policies and ethical guidelines
Publish with Us
Resources
- - Instructors
  - Course Adoption
  - Teaching Cases
  - K-12 Online Learning Collection
  - Authors and Editors
  - eEditorial Discovery^® System
  - Peer Review Process
  - Ethics and Malpractice
  - COPE Membership
  - Fair Use Policy
  - Open Access Publishing
  - FAQ
Catalogs
About Us

Implementation and Testing Details of Document Classification

Source Title: Developing a Keyword Extractor and Document Classifier: Emerging Research and Opportunities

DOI: 10.4018/978-1-7998-3772-5.ch010

OnDemand:

(Individual Chapters)

Available

$37.50

Current Special Offers

No Current Special Offers

Abstract

It is trivial to achieve a recall of 100% by returning all documents in response to any query. Therefore, recall alone is not enough, but one needs to measure the number of non-relevant, for example by computing the precision. The analysis was performed for 30 documents to ensure the stability of precision and recall values. It is observed that the precision of large documents is less than a moderate length document, in the sense that some unimportant keywords get extracted. The reason for this may be attributed to the frequent occurrence and its unimportant role in the sentence.

Chapter Preview

Top

System Testing

Reuters Data Set

Researchers have used benchmark data, such as the Reuters- 21578 corpus of newswire test collection (Sholom M. W., Indurkhya, N., Zhang, T. and Damerau, F. 2010), to measure advances in automated text classification. We performed testing of our system using a sample of the same.

Modules of Execution
- 1.
  Document Entry
- 2.
  Stop Word removal
- 3.
  Stemming
- 4.
  Keyword generation
- 5.
  Document Classification
Document Entry
Doc_id : DOC1
Doc_content :

“The hard problem of the Text Classification usually has various aspects and Potential solutions. Keyword extraction and maximal frequent item set can be used as attributes for mining association rules or as a basis for measuring the similarity of new documents with existing association rules. The issue of keyword extraction from text collection is an emerging research field. It also promotes maximal frequent item set generation.”

Stop Word Removal (Tokenize and Remove Stop Words)

We are using white space as delimiter to tokenize a document string. A tokenized document contains only language-specific alphabets in lower case and all unnecessary characters such as “,” will be removed from the list. Table 1 shows that the tokenization process is not only splitting the words but also changing entire tokenized words into a lowercase format. All tokenized words will then undergo the process of removing stop words.

There are many stop words exist in the above document. To purge them out, a list of predefined stop words must be developed first. The program will then identify and finally remove all the stop words in the document based on the predefined list. Table 2 displays the list of stop words.

Table 1.

Words after tokenization

hard
problem
text
classification
aspects
potential
solution
keyword
extraction
maximal
frequent
item
set
used
attributes
mining
association
rules
basis
measuring
similarity
new
documents
existing
association rules
issue
keyword
extraction
text
collection
emerging
research
filed
promotes
maximal
frequent
item
set
generation

Table 2.

Removed set of stop words

the
of
usually
has
various
and
can
be
a
used
as
or
for
with
is
an
from
it
also

Complete Chapter List

Search this Book:

Reset

MLA

APA

Chicago

Export Reference