Parallel Outlier Detection for Streamed Data Using Non-Parameterized Approach

Parallel Outlier Detection for Streamed Data Using Non-Parameterized Approach

Harshad Dattatray Markad (Zeal College of Engineering and Research, Pune, India) and S. M. Sangve (Zeal College of Engineering and Research, Pune, India)
Copyright: © 2017 |Pages: 13
DOI: 10.4018/IJSE.2017070102


Outlier detection is used in various applications like detection of fraud, network analysis, monitoring traffic over networks, manufacturing and environmental software. The data streams which are generated are continuous and changing over time. This is the reason why it becomes nearly difficult to detect the outliers in the existing data which is huge and continuous in nature. The streamed data is real time and changes over time and hence it is impractical to store data in the data space and analyze it for abnormal behavior. The limitations in data space has led to the problem of real time analysis of data and processing it in FCFS basis. The results regarding the abnormal behavior have to be done very quickly and in a limited time frame and on an infinite set of data streams coming over the networks. To address the problem of detecting outliers on a real-time basis is a challenging task and hence has to be monitored with the help of the processing power used to design the graphics of any processing unit. The algorithm used in this paper uses a kernel function to accomplish the task. It produces timely outcome on high speed multi- dimensional data. This method increases the speed of outlier detection by 20 times and the speed goes on increasing with the increase with the number of data attributes and input data rate.
Article Preview


Outlier detection mechanism is the detecting of a different pattern or an unusual pattern that is different from the rest of the normal data set. Outlier detection is usually done to indicate and identify the defective data or a behavior. Sometimes outlier detection is done to analyses the data for security and scientific interest. Instead of discarding the data, researchers sometimes compose the data pattern in the form of data mining technique so that same pattern can be detected very easily in the near future.

The data incoming to the system is in the form of chunks or streams. The data in the form of chunks are usually datagram. The datagram is similar in size and is possible to detect and identify the outliers in it very easily. When the data is the form of streams, it becomes difficult to analyze the data on the regular basis as the stream is continuous and enormous data comes into the system. As the streamed data is continuous is nature, it becomes nearly impractical to store it in a memory and analyze it thoroughly for abnormal behavior. This gave rise to the use of stream outlier detection mechanism which works on one pass basis. The streamed data is put into the tunnel of data outlier detection mechanism such that the outlier is detected in a single pass which overcome the need to store the data in the memory and analyze it fragment by fragment.

Although the density-based outlier detection approaches are proven to be accurate, they are also known to be computationally demanding. Therefore, when using kernel density estimation to detect outliers in a high volume, high-speed data stream, we need to speed up the computation to keep up the rate of input that is stream data. The purpose of designing the Graphics Processing Units (GPUs) is to handle high parallel workloads also to execute thousands of threads which are concurrent. With the introduction of CUDA (Compute Unified Device Architecture) (Nair et al., 2011; Hudlicka, 2011; Zhang et al., 2017), it is also called as architecture for general purpose parallel computing and programming model, GPU computing became more and more popular in general-purpose data mining applications. To comply with the real-time processing requirements of streaming data, we use parallel processing powers of GPUs to accelerate kernel density estimation and generate timely manner results. As compare to multi-core implementation our experimental results show that method achieved 20 times higher speed on real-world datasets.

In the mechanism of outlier detection, sometimes the nature of data changes for prolong period of time. Hence it is very difficult to mark the behavior as outlier as soon as the behavior is detected. This gave rise to the mechanism of dividing the data into batches also called as windows. The portion of the data stream called window is analyzed for outlier behavior.

Once the detection algorithm finds the outlier, before marking it as an outlier behavior, it checks the series of windows before and after the outlier-window so that it makes easy to decide the outlier occurrence. It is also said that the entire data set over a period of time has to be considered before coming over to the final decision of abnormal behavior.

As shown in the Figure 1, there are 2 windows having cluster named as A, A’, B, B’ and H. As researchers see that in the first window, there are 2 clusters named as A and A’. As these clusters are separated from each other and has sufficient number of members, researchers cannot consider either of them as an outlier.

The second window also has the same case. The second window has B and B’. It would be very early to consider either of them as an outlier as the behavior of the network or the data generated over the network has a variation. This gave rise to the most dynamic solution of cumulative resulting where the behavior of both the results are considered together to identify the outlier or abnormal behavior.

Figure 1.

Cluster and Cumulative results


Once the cumulative result is obtained, the cluster named as H is neither collaborated with other clusters nor has sufficient members to treat it as a cluster. It leads to the final conclusion that the cluster H is an abnormal behavior indentified over the streamed data and can be considered as an outlier.

Complete Article List

Search this Journal:
Open Access Articles
Volume 12: 2 Issues (2021): Forthcoming, Available for Pre-Order
Volume 11: 2 Issues (2020): 1 Released, 1 Forthcoming
Volume 10: 2 Issues (2019)
Volume 9: 2 Issues (2018)
Volume 8: 2 Issues (2017)
Volume 7: 2 Issues (2016)
Volume 6: 2 Issues (2015)
Volume 5: 2 Issues (2014)
Volume 4: 2 Issues (2013)
Volume 3: 2 Issues (2012)
Volume 2: 2 Issues (2011)
Volume 1: 2 Issues (2010)
View Complete Journal Contents Listing