Article Preview
TopIntroduction
Data mining is the most critical work in the era of big data. Cluster analysis is one of the most basic tasks of data mining (Kao & Cheng, 2006), which can divide a set of data objects into multiple groups. Data objects located in the same group indicated that they have close similarities. Otherwise, they will belong to different groups. (Ding et al. 2016; Yang et al.2004). By analyzing the similarity and dissimilarity between data in the data set, data objects are grouped or clustered (Hidayat, Fatichah, & Ginardi, 2016; Jabbar, Ku-Mahamud, & Sagban, 2018). Cluster analysis is also called unsupervised learning because class labels and even the number of classes of data objects are unknown before analyzing the data (Gonzalez-Pardo, Jung, & Camacho, 2017; Han, Pei, & Kamber, 2011). Although cluster analysis and classification prediction tasks are not equal, cluster analysis can be used as a prerequisite for classification (Baig, Shahzad, & Khan, 2013). That is, when a set of data objects was unknown about what kinds of labels it can be divided into, cluster analysis could be firstly used to divide the similar data objects into the same groups. And then, according to certain principles, class labels are affixed to those groups. If data is sufficient, class labels generated by the data set can be used for data classification, and the data set can be used as a training data set of the classification task.
Over the past two decades, group intelligence has attracted a great deal of interest among researchers because of its dynamic and flexible capabilities and its advantages in solving real-world nonlinear problems with high efficiency, and many group intelligence-based algorithms have been introduced for optimization in various areas of computer science (Anand Nayyar & Nayyar, 2018). Ant colony optimization algorithm is a swarm intelligence algorithm developed based on natural genetics and natural evolution of biological circles (Gonzalez-Pardo et al., 2017). As part of group intelligence, it solves complex combinatorial optimization problems by mimicking cooperative behavior among ants (Anand Nayyar, 2018). The algorithm has great global search ability and does not depend on the form of objective functions, so it is applied to solving the clustering problem (Menéndez, Otero, & Camacho, 2016; Monmarché, Slimane, & Venturini, 1999). At the same time, it has a particularly good ability to solve discrete, stochastic, dynamic problems (A Nayyar & Singh, 2016),and routing issues of sensor networks (Anand Nayyar & Singh, 2014). Basic analysis ant colony clustering algorithm (ACOC) aims to assign N data objects into K groups, by making the square of the Euclidean minimize between the data object and center of the corresponding group (Zhang Jianhua Jiang He, 2006). ACOC uses artificial ants (agent) to construct paths, each artificial ant starts with an empty string with length N, and each element in the string represents a data object in the data set. The value of this element object represents the grouping to which the corresponding data object is assigned. (Gao, Wang, Cheng, Inazumi, & Tang, 2016; Pei Zhenkui Li Hua, 2008).In order to improve the convergence rate, the principle of direct allocation is adopted in the initial stage of the ACOC algorithm, putting the ants on the data point at random and generating random global memory(Wang & Luo, 2019).In order to further improve the ACOC convergence and search ability, the variation factor of genetic algorithm was combined to improve the ant colony algorithm, and it enables the ant colony algorithm to generate genetic algorithm initial data in each iteration process, so as to improve the species diversity, expand the search scope of the solution and avoid getting into the local optimal solution dilemma(Wu, Yan, Zhang, & Shen, 2018).A hybrid algorithm for Big Data preprocessing ACO-clustering algorithm approach was proposed, which can help to increase search speed by optimizing the process. As the proposed method using ant colony optimization with clustering algorithm it will also contribute to reducing pre-processing time and increasing analytical accuracy and efficiency(Singh, Singh, & Pant, 2019).