Article Preview
Top1. Introduction
There is a good saying that firms can be data-rich but information-poor. While this phenomenon can be observed for even common data, it is especially notable for big data. There is a large amount of data produced by, for instance, people’s social activities and various types of equipment, but notably little useful information is obtained from it. If these data cannot be efficiently processed, increasing quantities of data will be accumulated, wasting storage space.
With the development of data mining technology, this problem has been readily solved (Shanshan, Jindian, Pengfei, & Hao, 2016). Data mining technologies, such as clustering, classification, and association analysis, are making it easier for people to obtain the common patterns from the data. However, “one person's noise may be another person's signal” (Kriegel, Kröger, & Zimek, 2010). The uncommon patterns in mass data may have amazing value.
Along with the age of big data, data mining has become much more challenging (Kun-Ming, Sheng-Hui, Li-Wei, & Shu-Hao, 2015). Many industries have set about applying big data technology, in order to mine more valuable information from big data (García-Recuero, Esteves, & Veiga, 2014). However, due to limitations resulting from big data’s complex datatypes, which is also called the variety challenge (Xiaofeng & Xiang, 2013), the industries always suffer from the duplication of building big data analysis systems, which results in a waste of money. The variety of datatypes seriously tests data mining ability (Xuejiao, Xiaofeng, & Yang, 2013).
Outlier detection, which is one of the most important data mining methods, can detect uncommon patterns in mass data (Aggarwal, 2015). The most influential definition is Hawkins’s definition: “An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism” (Hawkins, 1980). Outlier detection has found an increasingly wide utilization in many fields (Pimentel, Clifton, Clifton, & Tarassenko, 2014), such as credit card fraud detection (Yu & Wang, 2009), public health (Srimani & Koti, 2012), network intrusion detection (Othman, Bakar, Ibrahim, Hassim, & Ain, 2013), etc.
Among various outlier detection methods, distance-based algorithms have excellent universality (Pimentel et al., 2014). A distance-based outlier definition can be used in conjunction with a complete distance-based outlier detection algorithm. In other words, only distance information is used. This approach is known as metric space outlier detection, or MSOD. MSOD has striking advantages in overcoming the variety challenge of big data. Among MSOD, the index-based outlier detection method has a higher detection speed than other detection methods (Bhaduri, Matthews, & Giannella, 2011).
However, for most existing index-based methods, domain-specific information is more or less used. Further, certain index-based methods use a pivot technique but have not provided a pivot selection method and only use a single pivot, leading to spatial warping. In addition, full use has not been made of the distance triangle inequality. Finally, the sparse region of the dataset is ignored, resulting in the slow increase of the outlier degree’s cutoff value.
To solve these problems, we propose a metric space-based big data abstraction framework. Based on this framework, the Hilbert Index Outlier Detection algorithm is proposed together with a pivot selection method. Specifically, we make the following contributions.
- 1.
Metric space-based big data abstraction framework.
- 2.
A pivot selection method, which can quickly select pivots from approximate dense regions and takes into consideration the distances between different pivots.
- 3.
Hilbert-based outlier detection algorithm, which first detects sparse regions such that the cutoff value of outlier degree can be improved as soon as possible.
- 4.
Three pruning rules are applied in order to reduce the distance calculation times.