Special Offers
- IGI Global’s New Emerging Topic e-Book Collections
  Acquire highly focused and affordable Cutting-Edge Peer-Reviewed Research Content through a selection of 17 topic-focused e-Book Collections discounted up to 90%, compared to list prices. Collection topics include Artificial Intelligence, Data Science, Language Learning, Marketing and Customer Relations, Sustainability, and many more. Hosted on the InfoSci^® platform, these collections feature no DRM, no additional cost for multi-user licensing, no embargo of content, full-text PDF & HTML format, and more.
  Learn More
- Open Access Book (Free Access) - Encyclopedia of Information Science and Technology, Sixth Edition (ISBN: 9781668473665)
  The Encyclopedia of Information Science and Technology, Sixth Edition) continues the legacy set forth by the first five editions by providing comprehensive coverage and up-to-date definitions of the most important issues, concepts, and trends pertaining to technological advancements and information management within a variety of settings and industries. The entire book is being published under open access.
  Read Now
- Open Access Book (Free Access) - Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries (ISBN: 9781668456293)
  Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries provides information on the recent technology, mitigation, and environmental protection that must be applied for food sustainability in developing countries. This book is being published under Platinum Open Access through funding from Diponegoro University, Indonesia.
  Read Now
- Open Access Book (Free Access) - New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY (ISBN: 9781668438091)
  The Walmart Corporation and the Lumina Foundation have provided funding to make New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY fully open access, completely removing any paywall between scholars in education and the latest research on new models for the future of higher education.
  Read Now
- Open Access Book (Free Access) - Handbook of Research on the Global View of Open Access and Scholarly Communications (ISBN: 9781799898054)
  Through a collaboration between IGI Global and the University of North Texas, the Handbook of Research on the Global View of Open Access and Scholarly Communications has been published as fully open access, completely removing any paywall between researchers of any field, and the latest research on the equitable and inclusive nature of Open Access and all of its complications.
  Read Now
Books
- - Books by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Books by Field
Journals
- - Journals
  - OnDemand Journal Articles
  - Journals by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Journals by Field
e-Collections
OnDemand
Open Access
- View All Open Access Opportunities
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Find an Open Access Journal for Your Next Manuscript
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Submit an Open Access Book Proposal
  Learn more about open access book publishing and how it can propel your research forward in the field.
  Convert Your Work to Open Access
  Already published? You can convert your work to open access to increase its impact through IGI Global’s Restrospective Open Access Program.
  Utilize Open Access Collection Database
  Open up your research potential by utilizing our open access content or integrating the open access collection into your library
  Consider Open Access Agreements
  For Libraries: consider no-cost or investment-level open access agreements with IGI Global to support your faculty's research endeavors.
  Search Funding Resources
  Looking for additional funding resources to support your open accesss endeavors? View industry resources compiled by our open access team.
  Review Open Access Policies & Ethical Guidelines
  Considering IGI Global to publish your work under open access? Review IGI Global’s open access policies and ethical guidelines
Publish with Us
Resources
- - Instructors
  - Course Adoption
  - Teaching Cases
  - K-12 Online Learning Collection
  - Authors and Editors
  - eEditorial Discovery^® System
  - Peer Review Process
  - Ethics and Malpractice
  - COPE Membership
  - Fair Use Policy
  - Open Access Publishing
  - FAQ
Catalogs
About Us

Improving the K-Means Clustering Algorithm Oriented to Big Data Environments

Joaquín Pérez Ortega, Nelva Nely Almanza Ortega, Andrea Vega Villalobos, Marco A. Aguirre L., Crispín Zavala Díaz, Javier Ortiz Hernandez, Antonio Hernández Gómez

Source Title: Handbook of Research on Natural Language Processing and Smart Service Systems

DOI: 10.4018/978-1-7998-4730-4.ch013

OnDemand:

(Individual Chapters)

Available

$37.50

Current Special Offers

No Current Special Offers

Abstract

In recent years, the amount of texts in natural language, in digital format, has had an impressive increase. To obtain useful information from a large volume of data, new specialized techniques and efficient algorithms are required. Text mining consists of extracting meaningful patterns from texts; one of the basic approaches is clustering. The most used clustering algorithm is k-means. This chapter proposes an improvement of the k-means algorithm in the convergence step; the process stops whenever the number of objects that change their assigned cluster in the current iteration is bigger than the ones that changed in the previous iteration. Experimental results showed a reduction in execution time up to 93%. It is remarkable that, in general, better results are obtained when the volume of the text increase, particularly in those texts within big data environments.

Chapter Preview

Top

Introduction

In recent years the amount of data has increased dramatically, contributing to this increase, for example, the Internet, social media, patient data, and book repositories in digital format, among others. To give an idea of the accelerated growth of the volume of data, we will mention that according to the International Data Corporation (Gantz, & Reinsel, 2011), the world generated 1.8 zettabytes of digital information in 2011 and it is expected that in 2020 it will be the product of multiplying for 50 such amount (Ingersoll, Morton, & Farris, 2012). Much of the information generated is textual because it is the most natural way to store it. This amount of text is an invaluable source of information and knowledge, which allows us to make better decisions or improve our understanding of parts of reality for study. However, the handling and processing of large and complex volumes of data with standard tools is generally limited by computer resources. In this sense, our contribution here is to provide a strategy to deal with the problem of grouping objects (documents, paragraphs, and terms) according to their attributes (dimensions, characteristics, properties) in the context of Big Data.

The problem being addressed is complex and requires the integration of knowledge from several disciplines, including natural language processing, text mining, combinatorial optimization, and Lloyd’s K-Means algorithm, among others.

The rest of the chapter proceeds as follows. In the Background section, the clustering problem is formulated as an optimization problem. The K-Means algorithm is described, and its behavior is analyzed. The subsection Literature Review on K-Means briefly reviews the most relevant research to improve K-Means. Highlights of work related to Text Mining applications are shown in the Literature review of K-Means in Text Mining subsection. In the Proposed Algorithm section, the behavior of K-Means is analyzed, and it was observed:

a)
a strong correlation between the decreasing value of the objective function and the number of objects that change group by iteration and
b)
that the value of the number of objects that change group starts to oscillate when the value of the objective function changed very little between one iteration and another.

These observations inspired our proposal for a new convergence criterion. In the Experimental validation section, we discuss the validation of our proposal by testing with real and synthetic datasets. Possible extensions of our research are shown in the Future Research Directions section. Finally, our conclusions are presented in the Conclusions section.

Key Terms in this Chapter

Text Mining: Process of analyzing textual sources, with the aim of identifying relevant patterns.

Normalization: Tasks aimed at putting all text on an equal footing, i.e. removing punctuation, converting numbers into words, among others.

Clustering: The clustering consists in partitioning a set of n objects in k = 2 non-empty subsets (called clusters) in such a way that the objects in any cluster have similar attributes and, at the same time, are different from the objects in any other cluster.

Filtering: Set of techniques for removing words from a text.

Tokenization: Task of dividing a sentence, paragraph or a whole text into smaller units such as words or terms (tokens).

Stemming: Methods aimed to convert the derived words into the word origin or root.

Lemmatization: Relates a derived word to its canonical form.

Big Data: Although there is no consensus on how much should be considered Big Data, it is possible to define it as the point at which the computing capabilities of a given piece of equipment are exceeded.

NLP: Part of Artificial Intelligence that allows machines to understand and interpret human language.

Data Mining: Process of discovering relevant patterns in a dataset.

Complete Chapter List

Search this Book:

Reset

MLA

APA

Chicago

Export Reference

Improving the K-Means Clustering Algorithm Oriented to Big Data Environments

Abstract

Introduction

Key Terms in this Chapter

Complete Chapter List