Special Offers
- IGI Global’s New Emerging Topic e-Book Collections
  Acquire highly focused and affordable Cutting-Edge Peer-Reviewed Research Content through a selection of 17 topic-focused e-Book Collections discounted up to 90%, compared to list prices. Collection topics include Artificial Intelligence, Data Science, Language Learning, Marketing and Customer Relations, Sustainability, and many more. Hosted on the InfoSci^® platform, these collections feature no DRM, no additional cost for multi-user licensing, no embargo of content, full-text PDF & HTML format, and more.
  Learn More
- Open Access Book (Free Access) - Encyclopedia of Information Science and Technology, Sixth Edition (ISBN: 9781668473665)
  The Encyclopedia of Information Science and Technology, Sixth Edition) continues the legacy set forth by the first five editions by providing comprehensive coverage and up-to-date definitions of the most important issues, concepts, and trends pertaining to technological advancements and information management within a variety of settings and industries. The entire book is being published under open access.
  Read Now
- Open Access Book (Free Access) - Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries (ISBN: 9781668456293)
  Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries provides information on the recent technology, mitigation, and environmental protection that must be applied for food sustainability in developing countries. This book is being published under Platinum Open Access through funding from Diponegoro University, Indonesia.
  Read Now
- Open Access Book (Free Access) - New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY (ISBN: 9781668438091)
  The Walmart Corporation and the Lumina Foundation have provided funding to make New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY fully open access, completely removing any paywall between scholars in education and the latest research on new models for the future of higher education.
  Read Now
- Open Access Book (Free Access) - Handbook of Research on the Global View of Open Access and Scholarly Communications (ISBN: 9781799898054)
  Through a collaboration between IGI Global and the University of North Texas, the Handbook of Research on the Global View of Open Access and Scholarly Communications has been published as fully open access, completely removing any paywall between researchers of any field, and the latest research on the equitable and inclusive nature of Open Access and all of its complications.
  Read Now
Books
- - Books by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Books by Field
Journals
- - Journals
  - OnDemand Journal Articles
  - Journals by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Journals by Field
e-Collections
Open Access
- View All Open Access Opportunities
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Find an Open Access Journal for Your Next Manuscript
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Submit an Open Access Book Proposal
  Learn more about open access book publishing and how it can propel your research forward in the field.
  Convert Your Work to Open Access
  Already published? You can convert your work to open access to increase its impact through IGI Global’s Restrospective Open Access Program.
  Utilize Open Access Collection Database
  Open up your research potential by utilizing our open access content or integrating the open access collection into your library
  Consider Open Access Agreements
  For Libraries: consider no-cost or investment-level open access agreements with IGI Global to support your faculty's research endeavors.
  Search Funding Resources
  Looking for additional funding resources to support your open accesss endeavors? View industry resources compiled by our open access team.
  Review Open Access Policies & Ethical Guidelines
  Considering IGI Global to publish your work under open access? Review IGI Global’s open access policies and ethical guidelines
Publish with Us
Resources
- - Instructors
  - Course Adoption
  - Teaching Cases
  - K-12 Online Learning Collection
  - Authors and Editors
  - eEditorial Discovery^® System
  - Peer Review Process
  - Ethics and Malpractice
  - COPE Membership
  - Fair Use Policy
  - Open Access Publishing
  - FAQ
Catalogs
About Us
Newsroom

Machine Learning Applications in Mega-Text Processing

Marina Sokolova, Stan Szpakowicz

Source Title: Handbook of Research on Machine Learning Applications and Trends: Algorithms, Methods, and Techniques

DOI: 10.4018/978-1-60566-766-9.ch015

OnDemand:

(Individual Chapters)

Available

$37.50

Current Special Offers

No Current Special Offers

Abstract

This chapter presents applications of machine learning techniques to problems in natural language processing that require work with very large amounts of text. Such problems came into focus after the Internet and other computer-based environments acquired the status of the prime medium for text delivery and exchange. In all cases which the authors discuss, an algorithm has ensured a meaningful result, be it the knowledge of consumer opinions, the protection of personal information or the selection of news reports. The chapter covers elements of opinion mining, news monitoring and privacy protection, and, in parallel, discusses text representation, feature selection, and word category and text classification problems. The applications presented here combine scientific interest and significant economic potential.

Chapter Preview

Top

Introduction

The chapter presents applications of Machine Learning (ML) to problems which involve processing of large amounts of texts. Problems best served by ML came into focus after the Internet and other computer-based environments acquired the status of the prime medium for text delivery and exchange. That is when the ability to work extremely large amounts of texts, which ML applications had not previously faced, became a major issue. The resulting set of techniques and practices, which we name mega-text language processing, are meant to deal with a mass of informally written, loosely edited text. A case in point is the analysis of opinions expressed in short informal texts written and put on the Web by the general public (Liu 2006). The sheer volume and variety of suddenly available language data has necessarily invited the use of computing software capable of handling such a mass of data, learning from it and acquiring new information.

Until now, no clearly delineated subfield of Natural Language Processing (NLP) dealt with mega-texts – textual data on the Web, computer-mediated text repositories and in general texts in electronic format. Text Data Mining – a form of Data Mining – concerns itself with deriving new information from texts, but most often restrains from the study of language. Still, many researchers focus on the study of language, for example lexical, grammar and style issues, in such texts (Crystal 2006; Liu 2006). That no overarching NLP discipline has emerged can be explained by the fact that electronic texts and old-fashioned texts in books or newspapers share major characteristics. We discuss these characteristics in the handbook chapter “Machine Learning in Natural Language Processing”.

This chapter will show that ML techniques measure up well to the challenges that mega-texts pose. We focus on applications in aid of the study of language. In all cases which we discuss, an algorithm has ensured a meaningful result, be it the knowledge of consumer opinions, the protection of personal information or the selection of news reports. Although we mostly focus in this chapter on text classification problems, we go beyond document topic classification. English, the most popular language of the Web, is the default language of much of the scientific discourse. We state when problems deal with languages other than English.

In the chapter we cite standard measures used in NLP (Precision, Recall, F-score). Calculated for classifiers produced by an algorithm, they build on the numbers of correctly classified positive examples TP, incorrectly classified positive examples FP, and incorrectly classified negative examples FN.

Precision:

(1) Recall:

(2)

F-score is a weighted sum of Precision and Recall:

(3)

In some cases authors use the traditional Accuracy, which we cite:

A =

(4)

Key Terms in this Chapter

Opinion Mining: an automatic and semi-automatic search for expressed opinions in texts.

News Monitoring: automated tracking of online news.

Natural Language Processing: theory, design and implementation of systems for the analysis, understanding and generation of written or spoken language.

Mega-Text Language Processing: Natural Language Processing applied to large volumes of Web-based, computer-mediated, and other electronic-format texts.

Privacy Protection in Texts: protection of personal information that could reveal a person’s identity.

Mega-Text: large volumes of Web-based, computer-mediated, and other electronic-format texts

Text Classification: automatic assigning a text with a tag, chosen from a set of tags.

Complete Chapter List

Search this Book:

Reset

MLA

APA

Chicago

Export Reference

Machine Learning Applications in Mega-Text Processing

Abstract

Introduction

Key Terms in this Chapter

Complete Chapter List