Anomaly Detection Algorithm Based on Subspace Local Density Estimation

Anomaly Detection Algorithm Based on Subspace Local Density Estimation

Chunkai Zhang (Harbin Institute of Technology, ShenZhen, China) and Ao Yin (Harbin Institute of Technology, ShenZhen, China)
Copyright: © 2019 |Pages: 15
DOI: 10.4018/IJWSR.2019070103

Abstract

In this article, the authors propose a novel anomaly detection algorithm based on subspace local density estimation. The key insight of the proposed algorithm is to build multiple trident trees, which can implement the process of building subspace and local density estimation. Each trident tree (T-tree) is constructed recursively by splitting the data outside of 3 sigma into the left or right subtree and splitting the remaining data into the middle subtree. Each node in trident tree records the number of instances that falls on this node, so each trident tree can be used as a local density estimator. The density of each instance is the average of all trident tree evaluation instance densities, and it can be used as the anomaly score of instances. Since each trident tree is constructed according to 3 sigma principle, it can obtain good anomaly detection results without a large tree height. Theoretical analysis and experimental results show that the proposed algorithm is effective and efficient.
Article Preview
Top

1. Introduction

Anomaly detection is to detect the instances that deviate from most instances and do not obey the distribution of most instances (Liu, Chen, & Ni, 2014; Xie, Han, Tian, & Parvin, 2011; Yu, Tang, & Han, 2011). Anomaly detection has been widely used in several application domains. For example, health management can be achieved by detecting the status of health data. In cloud computing (Baucke et al., 2014), it is very important to detect abnormal traffic in the cloud with a timely manner. In cloud storage (Bellini et al., 2015; Ming-Jen Huang, 2016), timely detection of abnormal disk read and writes can greatly reduce the potential risk of cloud storage.

There are many anomaly detection algorithms, including distances-based algorithms (Knorr & Ng, 1998; Sugiyama & Borgwardt, 2013), classification-based algorithms (Abe, Zadrozny, & Langford, 2006; Shi & Horvath, 2006), clustering-based algorithms (He, Xu, & Deng, 2003; Niu, Huang, Zhang, & Chen, 2007), density-based algorithms (Breunig, Kriegel, Ng, & Sander, 2000; Salehi, Leckie, Bezdek, Vaithianathan, & Zhang, 2016; Tang & He, 2017; Yan, Cao, Kulhman, & Rundensteiner, 2017), angle-based algorithms (Kriegel, Hubert, & Zimek, 2008; Pham & Pagh, 2012), and subspace-based algorithms (Keller et al., 2012, Muller et al., 2011).

Distances-based algorithms determine abnormal instances by simply finding the instances that are distant with most of other instances. Cluster-based algorithms firstly cluster instances into different cluster by calculating the pair-wise distances. Then, it can determine anomaly instances that are the instances that do not lie in any large clusters. Some density-based algorithms measure the abnormal degree of instances based on the density of instances, such as local density-based algorithm Local Outlier Factor (LOF) (Breunig et al., 2000). And there are some variants of basic LOF algorithms proposed, such as connectivity-based outlier factor (CLOF) (Tang, Chen, Fu, & Cheung, 2002) and a spatial local outlier measure (SLOM) (Sun & Chawla, 2004). There is also an anomaly detection algorithm based on the local kernel density estimation (Tang & He, 2017), which need calculate k nearest neighbors of any instances. Most of above algorithms need calculate the distance between instances to determine whether they are abnormal. There are many ways to calculate the distance, such as Euclidean Distance, Cosine Distance, Edit Distance and so on. Most of above algorithms use Euclidean Distance, so these algorithms can also be influenced by the disadvantage of Euclidean Distance. For example, Euclidean Distance cannot obtain measure precise similar result for high dimensional data, since instances in high dimensional are distributed sparsely (Kriegel et al., 2008; Pham & Pagh, 2012; Yu et al., 2011). What’s more, many of the above algorithms are constrained to low dimensional data and small data size due to the high computational complexity of its original algorithm.

Complete Article List

Search this Journal:
Reset
Open Access Articles
Volume 17: 4 Issues (2020): 2 Released, 2 Forthcoming
Volume 16: 4 Issues (2019)
Volume 15: 4 Issues (2018)
Volume 14: 4 Issues (2017)
Volume 13: 4 Issues (2016)
Volume 12: 4 Issues (2015)
Volume 11: 4 Issues (2014)
Volume 10: 4 Issues (2013)
Volume 9: 4 Issues (2012)
Volume 8: 4 Issues (2011)
Volume 7: 4 Issues (2010)
Volume 6: 4 Issues (2009)
Volume 5: 4 Issues (2008)
Volume 4: 4 Issues (2007)
Volume 3: 4 Issues (2006)
Volume 2: 4 Issues (2005)
Volume 1: 4 Issues (2004)
View Complete Journal Contents Listing