Special Offers
- IGI Global’s New Emerging Topic e-Book Collections
  Acquire highly focused and affordable Cutting-Edge Peer-Reviewed Research Content through a selection of 17 topic-focused e-Book Collections discounted up to 90%, compared to list prices. Collection topics include Artificial Intelligence, Data Science, Language Learning, Marketing and Customer Relations, Sustainability, and many more. Hosted on the InfoSci^® platform, these collections feature no DRM, no additional cost for multi-user licensing, no embargo of content, full-text PDF & HTML format, and more.
  Learn More
- Open Access Book (Free Access) - Encyclopedia of Information Science and Technology, Sixth Edition (ISBN: 9781668473665)
  The Encyclopedia of Information Science and Technology, Sixth Edition) continues the legacy set forth by the first five editions by providing comprehensive coverage and up-to-date definitions of the most important issues, concepts, and trends pertaining to technological advancements and information management within a variety of settings and industries. The entire book is being published under open access.
  Read Now
- Open Access Book (Free Access) - Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries (ISBN: 9781668456293)
  Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries provides information on the recent technology, mitigation, and environmental protection that must be applied for food sustainability in developing countries. This book is being published under Platinum Open Access through funding from Diponegoro University, Indonesia.
  Read Now
- Open Access Book (Free Access) - New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY (ISBN: 9781668438091)
  The Walmart Corporation and the Lumina Foundation have provided funding to make New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY fully open access, completely removing any paywall between scholars in education and the latest research on new models for the future of higher education.
  Read Now
- Open Access Book (Free Access) - Handbook of Research on the Global View of Open Access and Scholarly Communications (ISBN: 9781799898054)
  Through a collaboration between IGI Global and the University of North Texas, the Handbook of Research on the Global View of Open Access and Scholarly Communications has been published as fully open access, completely removing any paywall between researchers of any field, and the latest research on the equitable and inclusive nature of Open Access and all of its complications.
  Read Now
Books
- - Books by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Books by Field
Journals
- - Journals
  - OnDemand Journal Articles
  - Journals by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Journals by Field
e-Collections
Open Access
- View All Open Access Opportunities
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Find an Open Access Journal for Your Next Manuscript
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Submit an Open Access Book Proposal
  Learn more about open access book publishing and how it can propel your research forward in the field.
  Convert Your Work to Open Access
  Already published? You can convert your work to open access to increase its impact through IGI Global’s Restrospective Open Access Program.
  Utilize Open Access Collection Database
  Open up your research potential by utilizing our open access content or integrating the open access collection into your library
  Consider Open Access Agreements
  For Libraries: consider no-cost or investment-level open access agreements with IGI Global to support your faculty's research endeavors.
  Search Funding Resources
  Looking for additional funding resources to support your open accesss endeavors? View industry resources compiled by our open access team.
  Review Open Access Policies & Ethical Guidelines
  Considering IGI Global to publish your work under open access? Review IGI Global’s open access policies and ethical guidelines
Publish with Us
Resources
- - Instructors
  - Course Adoption
  - Teaching Cases
  - K-12 Online Learning Collection
  - Authors and Editors
  - eEditorial Discovery^® System
  - Peer Review Process
  - Ethics and Malpractice
  - COPE Membership
  - Fair Use Policy
  - Open Access Publishing
  - FAQ
Catalogs
About Us
Newsroom

Revealing Groups of Semantically Close Textual Documents by Clustering: Problems and Possibilities

František Dařena, Jan Žižka

Source Title: Modern Computational Models of Semantic Discovery in Natural Language

DOI: 10.4018/978-1-4666-8690-8.ch004

OnDemand:

(Individual Chapters)

Available

$37.50

Current Special Offers

No Current Special Offers

Abstract

The chapter introduces clustering as a family of algorithms that can be successfully used to organize text documents into groups without prior knowledge of these groups. The chapter also demonstrates using unsupervised clustering to group large amount of unlabeled textual data (customer reviews written informally in five natural languages) so it can be used later for further analysis. The attention is paid to the process of selecting clustering algorithms, their parameters, methods of data preprocessing, and to the methods of evaluating the results by a human expert with an assistance of computers, too. The feasibility has been demonstrated by a number of experiments with external evaluation using known labels and expert validation with an assistance of a computer. It has been found that it is possible to apply the same procedures, including clustering, cluster validation, and detection of topics and significant words for different natural languages with satisfactory results.

Chapter Preview

Top

Introduction

People and companies have many opportunities to express their opinions related to a wide variety of topics. The media used for such communication include personal web pages and blogs, social networks, discussion boards, e-mail, instant messages, and others. Various subjects can benefit from a high availability of information, which also demands bigger involvement, knowledge, information processing and decision making skills. Due to huge volumes of data that is often freely available for many different subjects there is a need for approaches that enable to use the data for decision making. Since most of the data is available in an unstructured textual form, disciplines focusing on this type of data have gained on their significance during the last few years (Miner at al., 2012).

Because of inadequate time and effort that would be needed in order to reveal the knowledge hidden in the data, the processing cannot be often done manually by humans. Instead, the application of computer based automated methods is a more desirable choice. This is enabled by the availability of increased computational speed and memory sizes of ordinary computers as well as by the development of new algorithms that are able to address various needs and problems. Instead of a traditional methodology employing human operators for reading the documents, statistical analysis, and data mining techniques based on the non-linguistic structure of the documents (Dini & Mazzini, 2010), intelligent computer-based analysis called text mining might arrive at new and unforeseen results.

Text mining is a branch of computer science that uses techniques from data mining, information retrieval, machine learning, statistics, natural language processing, and knowledge management (Berry & Kogan, 2010). The greatest potential of text mining applications is in the areas where large quantities of textual data are generated and collected. These areas include, besides others, categorization of newspaper articles or web pages, e-mail filtering, organization of a library, customer complaints (or feedback) handling, marketing focus group programs, competitive intelligence, market prediction, extraction of topic trends in text streams, discovering semantic relations between events, or customer satisfaction analysis (Cao et al., 2014; Koteswara Rao & Dey, 2011; Miner at al., 2012; Nassirtoussi, 2014; Weiss et al., 2010). Text mining involves tasks such as text categorization, term extraction, single- or multi-document document summarization, clustering, association rules mining, or sentiment analysis (Feldman & Sanger, 2007).

At the end of the last century, machine learning gained on its popularity and became a dominant approach to text mining (Sebastiani, 2002). Machine learning is a discipline that focuses on modification or adaptation of computer behavior based on the past experience (the data in this case) so the behavior gets better in the future. Such an adaptation depends on whether there is the right behavior specified. If there is, it means that there is a set of examples with correct answers (actions) provided. In this case we talk about supervised learning. During the learning process a computer tries to generalize the knowledge to be able to react correctly to all, even previously unseen inputs. When the correct responses are not provided, a computer tries to find some patterns based on similarities between the inputs. This approach is known as unsupervised learning (Marsland, 2009). The common goal of both approaches is to achieve accuracy comparable to that achieved by human experts.

Complete Chapter List

Search this Book:

Reset

MLA

APA

Chicago

Export Reference

Revealing Groups of Semantically Close Textual Documents by Clustering: Problems and Possibilities

Abstract

Introduction

Complete Chapter List