Article Preview
Top1. Introduction
Anomaly detection is to detect the instances that deviate from most instances and do not obey the distribution of most instances (Liu, Chen, & Ni, 2014; Xie, Han, Tian, & Parvin, 2011; Yu, Tang, & Han, 2011). Anomaly detection has been widely used in several application domains. For example, health management can be achieved by detecting the status of health data. In cloud computing (Baucke et al., 2014), it is very important to detect abnormal traffic in the cloud with a timely manner. In cloud storage (Bellini et al., 2015; Ming-Jen Huang, 2016), timely detection of abnormal disk read and writes can greatly reduce the potential risk of cloud storage.
There are many anomaly detection algorithms, including distances-based algorithms (Knorr & Ng, 1998; Sugiyama & Borgwardt, 2013), classification-based algorithms (Abe, Zadrozny, & Langford, 2006; Shi & Horvath, 2006), clustering-based algorithms (He, Xu, & Deng, 2003; Niu, Huang, Zhang, & Chen, 2007), density-based algorithms (Breunig, Kriegel, Ng, & Sander, 2000; Salehi, Leckie, Bezdek, Vaithianathan, & Zhang, 2016; Tang & He, 2017; Yan, Cao, Kulhman, & Rundensteiner, 2017), angle-based algorithms (Kriegel, Hubert, & Zimek, 2008; Pham & Pagh, 2012), and subspace-based algorithms (Keller et al., 2012, Muller et al., 2011).
Distances-based algorithms determine abnormal instances by simply finding the instances that are distant with most of other instances. Cluster-based algorithms firstly cluster instances into different cluster by calculating the pair-wise distances. Then, it can determine anomaly instances that are the instances that do not lie in any large clusters. Some density-based algorithms measure the abnormal degree of instances based on the density of instances, such as local density-based algorithm Local Outlier Factor (LOF) (Breunig et al., 2000). And there are some variants of basic LOF algorithms proposed, such as connectivity-based outlier factor (CLOF) (Tang, Chen, Fu, & Cheung, 2002) and a spatial local outlier measure (SLOM) (Sun & Chawla, 2004). There is also an anomaly detection algorithm based on the local kernel density estimation (Tang & He, 2017), which need calculate k nearest neighbors of any instances. Most of above algorithms need calculate the distance between instances to determine whether they are abnormal. There are many ways to calculate the distance, such as Euclidean Distance, Cosine Distance, Edit Distance and so on. Most of above algorithms use Euclidean Distance, so these algorithms can also be influenced by the disadvantage of Euclidean Distance. For example, Euclidean Distance cannot obtain measure precise similar result for high dimensional data, since instances in high dimensional are distributed sparsely (Kriegel et al., 2008; Pham & Pagh, 2012; Yu et al., 2011). What’s more, many of the above algorithms are constrained to low dimensional data and small data size due to the high computational complexity of its original algorithm.