Article Preview
TopIntroduction
Outlier detection mechanism is the detecting of a different pattern or an unusual pattern that is different from the rest of the normal data set. Outlier detection is usually done to indicate and identify the defective data or a behavior. Sometimes outlier detection is done to analyses the data for security and scientific interest. Instead of discarding the data, researchers sometimes compose the data pattern in the form of data mining technique so that same pattern can be detected very easily in the near future.
The data incoming to the system is in the form of chunks or streams. The data in the form of chunks are usually datagram. The datagram is similar in size and is possible to detect and identify the outliers in it very easily. When the data is the form of streams, it becomes difficult to analyze the data on the regular basis as the stream is continuous and enormous data comes into the system. As the streamed data is continuous is nature, it becomes nearly impractical to store it in a memory and analyze it thoroughly for abnormal behavior. This gave rise to the use of stream outlier detection mechanism which works on one pass basis. The streamed data is put into the tunnel of data outlier detection mechanism such that the outlier is detected in a single pass which overcome the need to store the data in the memory and analyze it fragment by fragment.
Although the density-based outlier detection approaches are proven to be accurate, they are also known to be computationally demanding. Therefore, when using kernel density estimation to detect outliers in a high volume, high-speed data stream, we need to speed up the computation to keep up the rate of input that is stream data. The purpose of designing the Graphics Processing Units (GPUs) is to handle high parallel workloads also to execute thousands of threads which are concurrent. With the introduction of CUDA (Compute Unified Device Architecture) (Nair et al., 2011; Hudlicka, 2011; Zhang et al., 2017), it is also called as architecture for general purpose parallel computing and programming model, GPU computing became more and more popular in general-purpose data mining applications. To comply with the real-time processing requirements of streaming data, we use parallel processing powers of GPUs to accelerate kernel density estimation and generate timely manner results. As compare to multi-core implementation our experimental results show that method achieved 20 times higher speed on real-world datasets.
In the mechanism of outlier detection, sometimes the nature of data changes for prolong period of time. Hence it is very difficult to mark the behavior as outlier as soon as the behavior is detected. This gave rise to the mechanism of dividing the data into batches also called as windows. The portion of the data stream called window is analyzed for outlier behavior.
Once the detection algorithm finds the outlier, before marking it as an outlier behavior, it checks the series of windows before and after the outlier-window so that it makes easy to decide the outlier occurrence. It is also said that the entire data set over a period of time has to be considered before coming over to the final decision of abnormal behavior.
As shown in the Figure 1, there are 2 windows having cluster named as A, A’, B, B’ and H. As researchers see that in the first window, there are 2 clusters named as A and A’. As these clusters are separated from each other and has sufficient number of members, researchers cannot consider either of them as an outlier.
The second window also has the same case. The second window has B and B’. It would be very early to consider either of them as an outlier as the behavior of the network or the data generated over the network has a variation. This gave rise to the most dynamic solution of cumulative resulting where the behavior of both the results are considered together to identify the outlier or abnormal behavior.
Figure 1. Cluster and Cumulative results
Once the cumulative result is obtained, the cluster named as H is neither collaborated with other clusters nor has sufficient members to treat it as a cluster. It leads to the final conclusion that the cluster H is an abnormal behavior indentified over the streamed data and can be considered as an outlier.