Applying the K-Means Algorithm in Big Raw Data Sets with Hadoop and MapReduce

Applying the K-Means Algorithm in Big Raw Data Sets with Hadoop and MapReduce

Ilias K. Savvas, Georgia N. Sofianidou, M-Tahar Kechadi
Copyright: © 2014 |Pages: 24
DOI: 10.4018/978-1-4666-4699-5.ch002
(Individual Chapters)
No Current Special Offers


Big data refers to data sets whose size is beyond the capabilities of most current hardware and software technologies. The Apache Hadoop software library is a framework for distributed processing of large data sets, while HDFS is a distributed file system that provides high-throughput access to data-driven applications, and MapReduce is software framework for distributed computing of large data sets. Huge collections of raw data require fast and accurate mining processes in order to extract useful knowledge. One of the most popular techniques of data mining is the K-means clustering algorithm. In this study, the authors develop a distributed version of the K-means algorithm using the MapReduce framework on the Hadoop Distributed File System. The theoretical and experimental results of the technique prove its efficiency; thus, HDFS and MapReduce can apply to big data with very promising results.
Chapter Preview

There have been extensive studies on various clustering methods; and especially the k-means clustering has been given a great attention. However, there is very little on the application of k-means to the MapReduce. Since its early development, the k-means clustering (Lloyd, 1982) has been identified to have a very high complexity and significant effort has been spent to tune the algorithm and improve its performance. While k-means is very simple and straightforward algorithm, it has two main issues: 1) the choice of the number of clusters and of the initial centroids. 2) the iterative nature of the algorithm which impacts heavily on its scalability as the size of the dataset increases. Many researchers have come up with various algorithms that:

  • Improve the accuracy of the final clusters;

  • Help in choosing appropriate initial centroids;

  • Reduce the number of iterations;

  • Handle well the outliers.

Complete Chapter List

Search this Book: