Special Offers
- IGI Global’s New Emerging Topic e-Book Collections
  Acquire highly focused and affordable Cutting-Edge Peer-Reviewed Research Content through a selection of 17 topic-focused e-Book Collections discounted up to 90%, compared to list prices. Collection topics include Artificial Intelligence, Data Science, Language Learning, Marketing and Customer Relations, Sustainability, and many more. Hosted on the InfoSci^® platform, these collections feature no DRM, no additional cost for multi-user licensing, no embargo of content, full-text PDF & HTML format, and more.
  Learn More
- Open Access Book (Free Access) - Encyclopedia of Information Science and Technology, Sixth Edition (ISBN: 9781668473665)
  The Encyclopedia of Information Science and Technology, Sixth Edition) continues the legacy set forth by the first five editions by providing comprehensive coverage and up-to-date definitions of the most important issues, concepts, and trends pertaining to technological advancements and information management within a variety of settings and industries. The entire book is being published under open access.
  Read Now
- Open Access Book (Free Access) - Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries (ISBN: 9781668456293)
  Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries provides information on the recent technology, mitigation, and environmental protection that must be applied for food sustainability in developing countries. This book is being published under Platinum Open Access through funding from Diponegoro University, Indonesia.
  Read Now
- Open Access Book (Free Access) - New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY (ISBN: 9781668438091)
  The Walmart Corporation and the Lumina Foundation have provided funding to make New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY fully open access, completely removing any paywall between scholars in education and the latest research on new models for the future of higher education.
  Read Now
- Open Access Book (Free Access) - Handbook of Research on the Global View of Open Access and Scholarly Communications (ISBN: 9781799898054)
  Through a collaboration between IGI Global and the University of North Texas, the Handbook of Research on the Global View of Open Access and Scholarly Communications has been published as fully open access, completely removing any paywall between researchers of any field, and the latest research on the equitable and inclusive nature of Open Access and all of its complications.
  Read Now
Books
- - Books by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Books by Field
Journals
- - Journals
  - OnDemand Journal Articles
  - Journals by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Journals by Field
e-Collections
Open Access
- View All Open Access Opportunities
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Find an Open Access Journal for Your Next Manuscript
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Submit an Open Access Book Proposal
  Learn more about open access book publishing and how it can propel your research forward in the field.
  Convert Your Work to Open Access
  Already published? You can convert your work to open access to increase its impact through IGI Global’s Restrospective Open Access Program.
  Utilize Open Access Collection Database
  Open up your research potential by utilizing our open access content or integrating the open access collection into your library
  Consider Open Access Agreements
  For Libraries: consider no-cost or investment-level open access agreements with IGI Global to support your faculty's research endeavors.
  Search Funding Resources
  Looking for additional funding resources to support your open accesss endeavors? View industry resources compiled by our open access team.
  Review Open Access Policies & Ethical Guidelines
  Considering IGI Global to publish your work under open access? Review IGI Global’s open access policies and ethical guidelines
Publish with Us
Resources
- - Instructors
  - Course Adoption
  - Teaching Cases
  - K-12 Online Learning Collection
  - Authors and Editors
  - eEditorial Discovery^® System
  - Peer Review Process
  - Ethics and Malpractice
  - COPE Membership
  - Fair Use Policy
  - Open Access Publishing
  - FAQ
Catalogs
About Us
Newsroom

Applications of Machine Learning for Linguistic Analysis of Texts

Rosemary Torney, John Yearwood, Peter Vamplew, Andrei V. Kelarev

Source Title: Machine Learning Algorithms for Problem Solving in Computational Applications: Intelligent Techniques

DOI: 10.4018/978-1-4666-1833-6.ch008

OnDemand:

(Individual Chapters)

Available

$37.50

Current Special Offers

No Current Special Offers

Abstract

This chapter describes a novel multistage method for linguistic clustering of large collections of texts available on the Internet as a precursor to linguistic analysis of these texts. This method addresses the practicalities of applying clustering operations to a very large set of text documents by using a combination of unsupervised clustering and supervised classification. The method relies on creating a multitude of independent clusterings of a randomized sample selected from the International Corpus of Learner English. Several consensus functions and sophisticated algorithms are applied in two substages to combine these independent clusterings into one final consensus clustering, which is then used to train fast classifiers in order to enable them to perform the profiling of very large collections of text and web data. This approach makes it possible to apply advanced highly accurate and sophisticated clustering techniques by combining them with fast supervised classification algorithms. For the effectiveness of this multistage method it is crucial to determine how well the supervised classification algorithms are going to perform at the final stage, when they are used to process large data sets available on the Internet. This performance may also serve as an indication of the quality of the combined consensus clustering obtained in the preceding stages. The authors’ experimental results compare the performance of several classification algorithms incorporated in this multistage scheme and demonstrate that several of these classification algorithms achieve very high precision and recall and can be used in practical implementations of their method.

Chapter Preview

Top

Introduction

The Internet and email have revolutionised both business and personal communication methods, negating the problems of distance and time-zones Alrawi & Sabry (2009). Although there have always been a small percentage of dubious enterprises that are prepared to prey on unsuspecting customers, in the real world it is usually possible to trace these unscrupulous establishments. The anonymity of the Internet makes this far more difficult. There is no physical location to return to and the victim has not seen or heard the perpetrator to give a description to law enforcement agencies. Criminal elements seem to be relying on the anonymity of cyberspace to protect them while they engage in illegal activities such as scams, phishing and predatory behavior Chaski (2008). However, they must make contact with their victims, and this is usually achieved with some form of text communication. This is where authorship analysis can be applied to extract some details about the identity or profile of the author on the basis of their use of language. It has been discovered, for example, in Abbasi & Chen (2008), Baayen et al. (2002), Chaski (2005), that authors leave a textual “fingerprint” behind in their choice of language. Stylometry or authorship analysis, has been used to determine the authenticity of evidence presented for both the prosecution and defense in USA courts, as reported in Chaski (2008).

The development of automated methods for various aspects of linguistic analysis based on machine learning techniques is one of the major research topics which has been very actively investigated. To illustrate let us refer to just a few recent articles Agarwal et al. (2009), Bao et al. (2009), Bian & Tao (2009), Ikeda et al. (2009), Long et al. (2009), Malik & Kender (2008), Momma et al. (2009), Nakajima et al. (2005), Negi et al. (2009), Ni et al. (2007), Park et al. (2009), Roth et al. (2009), Sindhwani et al. (2008). Clustering of documents based on similar linguistc features often forms an early stage in these automated analysis methods.

Several authors have demonstrated that ensemble clustering approaches can be highly useful for solving various problems, as in Aho & Dzeroski (2009), Domeniconi et al. (2009), Lu et al. (2009), Read (2008). Highly sophisticated and effective consensus functions and heuristics for clustering ensembles have been developed, for example, by Ailon, Charikar and Newman (2005), Fern & Brodley (2004, 2004A), Goder & Filkov (2008), Strehl & Ghosh (2002). However such methods are not practically applicable to the very large number of documents that are often encountered in linguistic analysis tasks.

This article proposes a novel multistage method for linguistic clustering of very large collections of documents available on the Internet. The method is based on creating a multitude of independent initial clusterings of a randomized sample of texts from the International Corpus of Learner English. The ICLE corpus represents a unique collection of essays with detailed authorship information. Two substages of the method apply advanced consensus functions and sophisticated ensemble clustering algorithms to obtain final consensus clustering of the sample, which is then used to train fast supervised classifiers.

Complete Chapter List

Search this Book:

Reset

MLA

APA

Chicago

Export Reference

Applications of Machine Learning for Linguistic Analysis of Texts

Abstract

Introduction

Complete Chapter List