Improving the K-Means Clustering Algorithm Oriented to Big Data Environments

Improving the K-Means Clustering Algorithm Oriented to Big Data Environments

Joaquín Pérez Ortega, Nelva Nely Almanza Ortega, Andrea Vega Villalobos, Marco A. Aguirre L., Crispín Zavala Díaz, Javier Ortiz Hernandez, Antonio Hernández Gómez
DOI: 10.4018/978-1-7998-4730-4.ch013
OnDemand:
(Individual Chapters)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

In recent years, the amount of texts in natural language, in digital format, has had an impressive increase. To obtain useful information from a large volume of data, new specialized techniques and efficient algorithms are required. Text mining consists of extracting meaningful patterns from texts; one of the basic approaches is clustering. The most used clustering algorithm is k-means. This chapter proposes an improvement of the k-means algorithm in the convergence step; the process stops whenever the number of objects that change their assigned cluster in the current iteration is bigger than the ones that changed in the previous iteration. Experimental results showed a reduction in execution time up to 93%. It is remarkable that, in general, better results are obtained when the volume of the text increase, particularly in those texts within big data environments.
Chapter Preview
Top

Introduction

In recent years the amount of data has increased dramatically, contributing to this increase, for example, the Internet, social media, patient data, and book repositories in digital format, among others. To give an idea of the accelerated growth of the volume of data, we will mention that according to the International Data Corporation (Gantz, & Reinsel, 2011), the world generated 1.8 zettabytes of digital information in 2011 and it is expected that in 2020 it will be the product of multiplying for 50 such amount (Ingersoll, Morton, & Farris, 2012). Much of the information generated is textual because it is the most natural way to store it. This amount of text is an invaluable source of information and knowledge, which allows us to make better decisions or improve our understanding of parts of reality for study. However, the handling and processing of large and complex volumes of data with standard tools is generally limited by computer resources. In this sense, our contribution here is to provide a strategy to deal with the problem of grouping objects (documents, paragraphs, and terms) according to their attributes (dimensions, characteristics, properties) in the context of Big Data.

The problem being addressed is complex and requires the integration of knowledge from several disciplines, including natural language processing, text mining, combinatorial optimization, and Lloyd’s K-Means algorithm, among others.

The rest of the chapter proceeds as follows. In the Background section, the clustering problem is formulated as an optimization problem. The K-Means algorithm is described, and its behavior is analyzed. The subsection Literature Review on K-Means briefly reviews the most relevant research to improve K-Means. Highlights of work related to Text Mining applications are shown in the Literature review of K-Means in Text Mining subsection. In the Proposed Algorithm section, the behavior of K-Means is analyzed, and it was observed:

  • a)

    a strong correlation between the decreasing value of the objective function and the number of objects that change group by iteration and

  • b)

    that the value of the number of objects that change group starts to oscillate when the value of the objective function changed very little between one iteration and another.

These observations inspired our proposal for a new convergence criterion. In the Experimental validation section, we discuss the validation of our proposal by testing with real and synthetic datasets. Possible extensions of our research are shown in the Future Research Directions section. Finally, our conclusions are presented in the Conclusions section.

Key Terms in this Chapter

Text Mining: Process of analyzing textual sources, with the aim of identifying relevant patterns.

Normalization: Tasks aimed at putting all text on an equal footing, i.e. removing punctuation, converting numbers into words, among others.

Clustering: The clustering consists in partitioning a set of n objects in k = 2 non-empty subsets (called clusters) in such a way that the objects in any cluster have similar attributes and, at the same time, are different from the objects in any other cluster.

Filtering: Set of techniques for removing words from a text.

Tokenization: Task of dividing a sentence, paragraph or a whole text into smaller units such as words or terms (tokens).

Stemming: Methods aimed to convert the derived words into the word origin or root.

Lemmatization: Relates a derived word to its canonical form.

Big Data: Although there is no consensus on how much should be considered Big Data, it is possible to define it as the point at which the computing capabilities of a given piece of equipment are exceeded.

NLP: Part of Artificial Intelligence that allows machines to understand and interpret human language.

Data Mining: Process of discovering relevant patterns in a dataset.

Complete Chapter List

Search this Book:
Reset