Mining Text with the Prototype-Matching Method

Mining Text with the Prototype-Matching Method

A. Durfee (Appalachian State University, USA), A. Visa (Tampere University of Technology, Finland), H. Vanharanta (Tampere University of Technology, Finland), S. Schneberger (Appalachian State University, USA) and B. Back (Åbo Akademi University, Finland)
DOI: 10.4018/978-1-60566-128-5.ch020
OnDemand PDF Download:


Text documents are the most common means for exchanging formal knowledge among people. Text is a rich medium that can contain a vast range of information, but text can be difficult to decipher automatically. Many organizations have vast repositories of textual data but with few means of automatically mining that text. Text mining methods seek to use an understanding of natural language text to extract information relevant to user needs. This article evaluates a new text mining methodology: prototypematching for text clustering, developed by the authors’ research group. The methodology was applied to four applications: clustering documents based on their abstracts, analyzing financial data, distinguishing authorship, and evaluating multiple translation similarity. The results are discussed in terms of common business applications and possible future research.
Chapter Preview


It can be argued that computers are now used more for storing and retrieving data than computing data. Organizational computer systems are used for maintaining inventory, production, marketing, financial, sales, accounting, personnel, customer, and other types of data. With enterprise systems, vast amounts of corporate data can be stored digitally and made available to employees when and where needed. Data mining software is often used to further glean information from corporate databases.

A lot of transactional corporate data is numeric but not all of it. Indeed, it’s often stated that about 80% of corporate information is textual or unstructured information (for example, see Chen, 2001, and Robb, 2004). An entire information systems specialty—knowledge management—includes collecting, storing, organizing, evaluating, and using textual data such as prevalent with consulting agencies in vast repositories of written reports.

The World Wide Web provides access to planetary-wide databases of textual data for corporate users. Just one of hundreds of online article databases (Education Resources Information Center, or ERIC) has more than 1.2 million citations and 110,000 full text articles. Another, HighWire Press, has more than 1.3 million full text articles. Internal and external data sources offer extensive decision support for managers in dynamic, complex, and demanding business environments. But how can managers, decision makers, and knowledge workers find appropriate textual content among billions of words in internal and external document repositories when it’s virtually impossible to do so manually? Seventy-five percent of managers spend more than an hour per day just sorting their e-mails, according to a Gartner Group survey (Marino, 2001).

Compounding the problem is that text, by its very nature, can have multiple meanings and interpretations. The structure of text is not only complex but also not always directly obvious. Even the author of a text might not know the extent of what might be interpreted from the text. These features of text make it a very rich medium for conveying a wide range of meanings but also very difficult to manage, analyze, and mine using computers (Nasukawa & Nagano, 2001). Therein lies the conundrum: There is too much internal and external text to mine manually, but it’s problematic for computer software to correctly interpret let alone create knowledge from text.

Text mining (TM) looks for a remedy of that problem. TM seeks to extract high-level knowledge and useful patterns from low-level textual data. Text mining tools seek to analyze and learn the meaning of implicitly structured information automatically (Dorre, Gerstl, & Seiffert, 1999). There are two broad categories of textual mining: text categorization and text clustering.

Text categorization analyzes text using pre-determined structures or words (i.e., keywords). It is a framework-driven approach, usually based on earlier analysis or expectations. Authors, readers, and librarians may introduce and use keywords, indexes, or mark-ups to outline the main ideas, concepts and themes within a text to make textual searches easier for computers (Anderson, 1999; Chieng, 1997; Lahtinen, 2000; Salton, 1989; Weiss, White, Apte, & Damerau, 2000). However, authors and textual information users can assign different keywords to the same text, or even ascribe different meanings to the same keywords—possibly defeating the speed and accuracy of computer-based textual keyword searches. Readers need only consider their own wayward searches using keyword-based online search engines to understand the depth and breadth of the problem.

Complete Chapter List

Search this Book:
Associate Editors
Table of Contents
Mehdi Khosrow-Pour
Chapter 1
Manuel Mora, Ovsei Gelman, Guisseppi Forgionne, Doncho Petkov, Jeimy Cano
A formal conceptualization of the original concept of system and related concepts—from the original systems approach movement—can facilitate the... Sample PDF
Integrating the Fragmented Pieces of IS Research Paradigms and Frameworks: A Systems Approach
Chapter 2
Steven Alter
The work system method was developed iteratively with the overarching goal of helping business professionals understand IT-reliant systems in... Sample PDF
Could the Work System Method Embrace Systems Concepts More Fully?
Chapter 3
Alfonso Reyes A.
This chapter is concerned with methodological issues. In particular, it addresses the question of how is it possible to align the design of... Sample PDF
The Distribution of a Management Control System in an Organization
Chapter 4
Phillip Dobson
This chapter seeks to address the dearth of practical examples of research in the area by proposing that critical realism be adopted as the... Sample PDF
Making the Case for Critical Realism: Examining the Implementation of Automated Performance Management Systems
Chapter 5
Jo Ann Lane
As organizations strive to expand system capabilities through the development of system-of-systems (SoS) architectures, they want to know “how much... Sample PDF
System-of-Systems Cost Estimation: Analysis of Lead System Integrator Engineering Activities
Chapter 6
Kosheek Sewchurran, Doncho Petkov
The chapter provides an action research account of formulating and applying a new business process modeling framework to a manufacturing processes... Sample PDF
Mixing Soft Systems Methodology and UML in Business Process Modeling
Chapter 7
Aidan Duane, Patrick Finnegan
An email system is a critical business tool and an essential part of organisational communication. Many organisations have experienced negative... Sample PDF
Managing E-Mail Systems: An Exploration of Electronic Monitoring and Control in Practice
Chapter 8
Stephen V. Stephenson, Andrew P. Sage
This chapter provides an overview of perspectives associated with information and knowledge resource management in systems engineering and systems... Sample PDF
Information and Knowledge Perspectives in Systems Engineering and Management for Innovation and Productivity through Enterprise Resource Planning
Chapter 9
Gunilla Widén-Wulff, Reima Suomi
This chapter works out a method on how information resources in organizations can be turned into a knowledge sharing (KS) information culture, which... Sample PDF
The Knowledge Sharing Model: Stressing the Importance of Social Ties and Capital
Chapter 10
Jijie Wang
Escalation is a serious management problem, and sunk costs are believed to be a key factor in promoting escalation behavior. While many laboratory... Sample PDF
A Meta-Analysis Comparing the Sunk Cost Effect for IT and Non-IT Projects
Chapter 11
Georgios N. Angelou
E-learning markets have been expanding very rapidly. As a result, the involved senior managers are increasingly being confronted with the need to... Sample PDF
E-Learning Business Risk Management with Real Options
Chapter 12
C. Ranganathan
Research on online shopping has taken three broad and divergent approaches viz, human-computer interaction, behavioral, and consumerist approaches... Sample PDF
Examining Online Purchase Intentions in B2C E-Commerce: Testing an Integrated Model
Chapter 13
Nicholas C. Georgantzas
This chapter combines disruptive innovation strategy (DIS) theory with the system dynamics (SD) modeling method. It presents a simulation model of... Sample PDF
Information Technology Industry Dynamics: Impact of Disruptive Innovation Strategy
Chapter 14
Shana L. Dardan, Ram L. Kumar, Antonis C. Stylianou
This study develops a diffusion model of customer-related IT (CRIT) based on stock market announcements of investments in those technologies.... Sample PDF
Modeling Customer-Related IT Diffusion
Chapter 15
Bassam Hasan, Jafar M. Ali
The acceptance and use of information technologies by target users remain a key issue in information systems (IS) research and practice. Building on... Sample PDF
The Impact of Computer Self-Efficacy and System Complexity on Acceptance of Information Technologies
Chapter 16
James Jiang, Gary Klein, Eric T.G. Wang
The skills held by information system professionals clearly impact the outcome of a project. However, the perceptions of just what skills are... Sample PDF
Determining User Satisfaction from the Gaps in Skill Expectations Between IS Employees and their Managers
Chapter 17
James Jiang, Gary Klein, Phil Beck, Eric T.G. Wang
To improve the performance of software projects, a number of practices are encouraged that serve to control certain risks in the development... Sample PDF
The Impact of Missing Skills on Learning and Project Performance
Chapter 18
Leigh Jin, Daniel Robey, Marie-Claude Boudreau
Open source software has rapidly become a popular area of study within the information systems research community. Most of the research conducted so... Sample PDF
Beyond Development: A Research Agenda for Investigating Open Source Software User Communities
Chapter 19
Milam Aiken, Linwu Gu, Jianfeng Wang
In the literature of electronic meetings, few studies have investigated the effects of topic-related variables on group processes. This chapter... Sample PDF
Electronic Meeting Topic Effects
Chapter 20
A. Durfee, A. Visa, H. Vanharanta, S. Schneberger, B. Back
Text documents are the most common means for exchanging formal knowledge among people. Text is a rich medium that can contain a vast range of... Sample PDF
Mining Text with the Prototype-Matching Method
Chapter 21
Francis Kofi Andoh-Baidoo, Elizabeth White Baker, Santa R. Susarapu, George M. Kasper
Using March and Smith’s taxonomy of information systems (IS) research activities and outputs and Newman’s method of pro forma abstracting, this... Sample PDF
A Review of IS Research Activities and Outputs Using Pro Forma Abstracts
About the Contributors