Improving Efficiency of K-Means Algorithm for Large Datasets

Improving Efficiency of K-Means Algorithm for Large Datasets

Ch. Swetha Swapna (Department of Computer Science and Engineering, Jawaharlal Nehru Technological University, Kakinada, India), V. Vijaya Kumar (Department of Computer Science and Engineering, Jawaharlal Nehru Technological University, Hyderabad, India) and J.V.R Murthy (Department of Computer Science and Engineering, Jawaharlal Nehru Technological University, Kakinada, India)
Copyright: © 2016 |Pages: 9
DOI: 10.4018/IJRSDA.2016040101
OnDemand PDF Download:
$37.50

Abstract

Clustering is a process of grouping objects into different classes based on their similarities. K-means is a widely studied partitional based algorithm. It is reported to work efficiently for small datasets; however the performance is not very appreciable in terms of time of computation for large datasets. Several modifications have been made by researchers to address this issue. This paper proposes a novel way of handling the large datasets using K-means in a distributed manner to obtain efficiency. The concept of parallel processing is exploited by dividing the datasets to a number of baskets and then applying K-means in parallel manner to each such basket. The proposed BasketK-means provides a very competitive performance with considerably less computation time. The simulation results on various real datasets and synthetic datasets presented in the work clearly emphasize the effectiveness of the proposed approach.
Article Preview

The most natural and commonly used standard function in partitional clustering technique is squared-error criteria, which inclines to work well with isolated and compact clusters (Pang-Ning, Michael & Vipin, 2007; Xu & Wunsch, 2005). The sum of the squares of error between the points and the corresponding centroids is equal to the total intra-cluster variance:

(1) where distance is Euclidean distance between two objects in the Euclidean space, ci centroid of the ith cluster, and x is a data object.

A standard K-means algorithm was first proposed by Stuart Lloyd in the year 1957. The term K-means was first used by James Macqueen in 1967 (Jain, Murty & Flynn, 1999). K-means is a partitioning based clustering algorithm; it groups the objects in continuous n dimensional space, which uses centroid as a mean of the group of objects (Nazeer & Sebastian, 2009; Ramakrishna, JVR, Prasad & Suresh, 2014). Let us take P = {Pi}, i=1,…n be the set of data points, to be clustered into a set of K number of clusters without any prior knowledge of the input objects. Predefined number of groups is indicated with K, where K is provided as an input parameter. Assignment of each object to a cluster is based on the objects’ proximity to the mean of the cluster. Then the mean of the cluster is in turn recomputed and the process of assigning objects to cluster resumes (Ramakrishna, JVR, Prasad & Suresh, 2014).

Complete Article List

Search this Journal:
Reset
Open Access Articles: Forthcoming
Volume 4: 4 Issues (2017)
Volume 3: 4 Issues (2016)
Volume 2: 2 Issues (2015)
Volume 1: 2 Issues (2014)
View Complete Journal Contents Listing