Hilbert Index-based Outlier Detection Algorithm in Metric Space

Hilbert Index-based Outlier Detection Algorithm in Metric Space

Honglong Xu (Guangdong Province Key Laboratory of Popular High Performance Computers, Shenzhen University, Shenzhen, China), Haiwu Rong (School of Mathematics and Big Data, Foshan University, Foshan, China), Rui Mao (Guangdong Province Key Laboratory of Popular High Performance Computers, Shenzhen University, Shenzhen, China), Guoliang Chen (Guangdong Province Key Laboratory of Popular High Performance Computers, Shenzhen University, Shenzhen, China) and Zhiguang Shan (State Information Center, Beijing, China)
Copyright: © 2016 |Pages: 21
DOI: 10.4018/IJGHPC.2016100103
OnDemand PDF Download:
No Current Special Offers


Big data is profoundly changing the lifestyles of people around the world in an unprecedented way. Driven by the requirements of applications across many industries, research on big data has been growing. Methods to manage and analyze big data to extract valuable information are the key of big data research. Starting from the variety challenge of big data, this dissertation proposes a universal big data management and analysis framework based on metric space. In this framework, the Hilbert Index-based Outlier Detection (HIOD) algorithm is proposed. HIOD can handle all datatypes that can be abstracted to metric space and achieve higher detection speed. Experimental results indicate that HIOD can effectively overcome the variety challenge of big data and achieves a 2.02 speed up over iORCA on average and, in certain cases, up to 5.57. The distance calculation times are reduced by 47.57% on average and up to 89.10%.
Article Preview

1. Introduction

There is a good saying that firms can be data-rich but information-poor. While this phenomenon can be observed for even common data, it is especially notable for big data. There is a large amount of data produced by, for instance, people’s social activities and various types of equipment, but notably little useful information is obtained from it. If these data cannot be efficiently processed, increasing quantities of data will be accumulated, wasting storage space.

With the development of data mining technology, this problem has been readily solved (Shanshan, Jindian, Pengfei, & Hao, 2016). Data mining technologies, such as clustering, classification, and association analysis, are making it easier for people to obtain the common patterns from the data. However, “one person's noise may be another person's signal” (Kriegel, Kröger, & Zimek, 2010). The uncommon patterns in mass data may have amazing value.

Along with the age of big data, data mining has become much more challenging (Kun-Ming, Sheng-Hui, Li-Wei, & Shu-Hao, 2015). Many industries have set about applying big data technology, in order to mine more valuable information from big data (García-Recuero, Esteves, & Veiga, 2014). However, due to limitations resulting from big data’s complex datatypes, which is also called the variety challenge (Xiaofeng & Xiang, 2013), the industries always suffer from the duplication of building big data analysis systems, which results in a waste of money. The variety of datatypes seriously tests data mining ability (Xuejiao, Xiaofeng, & Yang, 2013).

Outlier detection, which is one of the most important data mining methods, can detect uncommon patterns in mass data (Aggarwal, 2015). The most influential definition is Hawkins’s definition: “An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism” (Hawkins, 1980). Outlier detection has found an increasingly wide utilization in many fields (Pimentel, Clifton, Clifton, & Tarassenko, 2014), such as credit card fraud detection (Yu & Wang, 2009), public health (Srimani & Koti, 2012), network intrusion detection (Othman, Bakar, Ibrahim, Hassim, & Ain, 2013), etc.

Among various outlier detection methods, distance-based algorithms have excellent universality (Pimentel et al., 2014). A distance-based outlier definition can be used in conjunction with a complete distance-based outlier detection algorithm. In other words, only distance information is used. This approach is known as metric space outlier detection, or MSOD. MSOD has striking advantages in overcoming the variety challenge of big data. Among MSOD, the index-based outlier detection method has a higher detection speed than other detection methods (Bhaduri, Matthews, & Giannella, 2011).

However, for most existing index-based methods, domain-specific information is more or less used. Further, certain index-based methods use a pivot technique but have not provided a pivot selection method and only use a single pivot, leading to spatial warping. In addition, full use has not been made of the distance triangle inequality. Finally, the sparse region of the dataset is ignored, resulting in the slow increase of the outlier degree’s cutoff value.

To solve these problems, we propose a metric space-based big data abstraction framework. Based on this framework, the Hilbert Index Outlier Detection algorithm is proposed together with a pivot selection method. Specifically, we make the following contributions.

  • 1.

    Metric space-based big data abstraction framework.

  • 2.

    A pivot selection method, which can quickly select pivots from approximate dense regions and takes into consideration the distances between different pivots.

  • 3.

    Hilbert-based outlier detection algorithm, which first detects sparse regions such that the cutoff value of outlier degree can be improved as soon as possible.

  • 4.

    Three pruning rules are applied in order to reduce the distance calculation times.

Complete Article List

Search this Journal:
Open Access Articles: Forthcoming
Volume 14: 6 Issues (2022): 1 Released, 5 Forthcoming
Volume 13: 4 Issues (2021)
Volume 12: 4 Issues (2020)
Volume 11: 4 Issues (2019)
Volume 10: 4 Issues (2018)
Volume 9: 4 Issues (2017)
Volume 8: 4 Issues (2016)
Volume 7: 4 Issues (2015)
Volume 6: 4 Issues (2014)
Volume 5: 4 Issues (2013)
Volume 4: 4 Issues (2012)
Volume 3: 4 Issues (2011)
Volume 2: 4 Issues (2010)
Volume 1: 4 Issues (2009)
View Complete Journal Contents Listing