Big Data Analytics Using Local Exceptionality Detection

Big Data Analytics Using Local Exceptionality Detection

Martin Atzmueller (University of Kassel, Germany), Dennis Mollenhauer (University of Kassel, Germany) and Andreas Schmidt (University of Kassel, Germany)
Copyright: © 2016 |Pages: 18
DOI: 10.4018/978-1-5225-0293-7.ch007
OnDemand PDF Download:
$30.00
List Price: $37.50

Abstract

Large-scale data processing is one of the key challenges concerning many application domains, especially considering ubiquitous and big data. In these contexts, subgroup discovery provides both a flexible data analysis and knowledge discovery method. Subgroup discovery and pattern mining are important descriptive data mining tasks. They can be applied, for example, in order to obtain an overview on the relations in the data, for automatic hypotheses generation, and for a number of knowledge discovery applications. This chapter presents the novel SD-MapR algorithmic framework for large-scale local exceptionality detection implemented using subgroup discovery on the Map/Reduce framework. We describe the basic algorithm in detail and provide an experimental evaluation using several real-world datasets. We tackle two algorithmic variants focusing on simple and more complex target concepts, i.e., presenting an implementation of exceptional model mining on large attributed graphs. The results of our evaluation show the scalability of the presented approach for large data sets.
Chapter Preview
Top

Introduction

With the exponential growth of the available data, e.g., due to ubiquitous applications and services, large-scale data mining provides many challenges. Efficient and scalable methods need to be developed that on the one hand provide the handling of such large data, on the other hand support an efficient and scalable analysis approach. In this chapter, we focus on subgroup discovery for local exceptionality detection on large datasets. During data exploration, the data analyst, for example, might be interested in partitions of the data that show some specific exceptional characteristics, and respective descriptions of these partitions. An exploratory analysis approach for identifying such a subset of the data with a concise description is given by subgroup discovery (e.g., Klösgen 1996; Wrobel 1997; Atzmueller 2015) – here, also specifically the variant of exceptional model mining (Leman 2008; Duivestein 2016) as an approach for modeling complex exceptionality criteria. Intuitively, subgroup discovery aims at identifying such an exceptional subgroup of the whole dataset, e.g., concerning notable different distribution of some target concept, where the subgroup typically also should be as large as possible. Exceptional model mining especially focuses on complex target properties; it considers specific model classes, such as a correlation model between two variables, linear regression, or complex graph properties.

Overall, subgroup discovery is a broadly applicable data mining technique which can be applied for descriptive data mining as well as predictive data mining. We can obtain an overview on the relations in the data, for example, for automatic hypotheses generation, for attribute construction, or for obtaining a rule-based classification model. The basic idea is to identify subgroups covering instances of the dataset, which show some interesting, i.e., unexpected, deviating or exceptional behavior, concerning a given target concept. This notion can be flexibly formalized using a quality function. We can estimate, for example, the deviation of the mean of a numeric target concept in the subgroup compared to the whole dataset; more complex functions utilizing graph-structured data consider, e.g., the density of a certain subgraph compared to the expected density of a null model given by a random edge assignment approach.

In this chapter, we present the novel SD-MapR algorithmic framework for large-scale subgroup discovery: Based on data projection techniques of the FP-Growth (Han et al. 2000) and the Parallel FP-Growth (PFP) algorithm (Li et al. 2008) for large-scale frequent pattern mining, SD-MapR employs the Map/Reduce framework (Dean & Ghemawat 2008) for large-scale data processing. The basic idea of SD-MapR is the construction of projected databases such that the subgroup discovery task can be independently deployed on several computation clusters in a divide-and-conquer manner, inspired by the PFP algorithm. For local exceptionality detection, we propose the efficient subgroup discovery algorithms SD-Map* (Atzmueller & Lemmerich 2009), GP-Growth (Lemmerich et al. 2012), and COMODO (Atzmueller et al. 2015a) which can be applied for instantiating SD-MapR. Specifically, we present specific adaptations of the SD-Map* and the COMODO (Atzmueller et al. 2015a) algorithms for implementing SD-MapR.

The remainder of this chapter is structured as follows: In the next section, we introduce some preliminaries on local exceptionality detection using subgroup discovery and exceptional model mining, the respective state-of-the-art algorithms, and the Map/Reduce framework. After that, we describe the novel SD-MapR algorithmic framewrok in detail. Next, we provide a comprehensive evaluation of the presented algorithms using ubiquitous data, and show the scalability and performance for large-scale datasets. Finally, we conclude with a summary and point out interesting options for future work.

Complete Chapter List

Search this Book:
Reset