Big Data Mining and Analytics

Big Data Mining and Analytics

Carson Kai-Sang Leung (The University of Manitoba, Canada)
Copyright: © 2014 |Pages: 10
DOI: 10.4018/978-1-4666-5202-6.ch030
OnDemand PDF Download:
$30.00
List Price: $37.50

Chapter Preview

Top

Introduction

Data mining and analytics aims to analyze valuable data—such as shopper market basket data—and extract implicit, previously unknown, and potentially useful information from the data. Due to advances in technology, high volumes of valuable data—such as streams of banking, financial, and marketing data—are generated in various real-life business applications in modern organizations and society. This leads us into the new era of Big Data (Madden, 2012; Mishne, Dalton, Li, Sharma, & Lin, 2013; Suchanek & Weikum, 2013). Intuitively, Big Data are interesting high-velocity, high-value, and/or high-variety data with volumes beyond the ability of commonly-used software to capture, manage, and process within a tolerable elapsed time. Hence, new forms of processing data are needed to enable enhanced decision making, insight, knowledge discovery, and process optimization. This drives and motivates research and practices in business analytics and optimization, which require techniques like Big Data mining and analytics, business process optimization, applied business statistics, as well as business intelligence solutions and information systems. Having developed systematic or quantitative processes to mine and analyze Big Data allows us to continuously or iteratively explore, investigate, and understand the past business performance so as to gain new insight and drive business planning. Over the past few years, several algorithms have been proposed that use the MapReduce model—which mines the search space with distributed or parallel computing—for different Big Data mining and analytics tasks (Luo, Ding, & Huang, 2012; Shi, 2012; Shim, 2012; Condie, Mineiro, Polyzotis, & Weimer, 2013; Kumar, Niu, & Ré, 2013). One such task is frequent pattern mining, which discovers interesting knowledge in the forms of frequently occurring sets of merchandise items or events. In this chapter, we focus mainly on frequent pattern mining from Big Data with MapReduce.

Top

Background

Since the introduction of the research problem of frequent pattern mining (Agrawal, Imieliński, & Swami, 1993), numerous algorithms have been proposed (Hipp, Güntzer, & Nakhaeizadeh, 2000; Ullman, 2000; Ceglar & Roddick, 2006). Notable ones include the classical Apriori algorithm (Agrawal & Srikant, 1994) and its variants such as the Partition algorithm (Savasere, Omiecinski, & Navathe, 1995). The Apriori algorithm uses a level-wise breadth-first bottom-up approach with a candidate generate-and-test paradigm to mine frequent patterns from transactional databases of precise data. The Partition algorithm divides the databases into several partitions and applies the Apriori algorithm to each partition to obtain patterns that are locally frequent in the partition. As being locally frequent is a necessary condition for a pattern to be globally frequent, these locally frequent patterns are tested to see if they are globally frequent in the databases. To avoid the candidate generate-and-test paradigm, the tree-based FP-growth algorithm (Han, Pei, & Yin, 2000) was proposed. It uses a depth-first pattern-growth (i.e., divide-and-conquer) approach to mine frequent patterns using a tree structure that captures the contents of the databases. By extracting appropriate tree paths, projected databases containing relevant transactions are formed, from which frequent patterns can be discovered.

In many real-life applications, the available data are not precise data but uncertain data (Chen & Wang, 2011; Tong, Chen, Cheng, & Yu, 2012; Jiang & Leung, 2013; Leung, Cuzzocrea, & Jiang, 2013; Leung & Tanbeer, 2013). Examples include sensor data and privacy-preserving data. Over the past few years, several algorithms—such as the tree-based UF-growth algorithm (Leung, Mateo, & Brajczuk, 2008)—have been proposed to mine and analyze these uncertain data.

Key Terms in this Chapter

Frequent Itemset (or Frequent Pattern): Is an itemset or a pattern having its actual support (or expected support) exceeds or equals the user-specified minimum support threshold.

Itemset: Is a set of items.

Business Intelligence: Is a set of theories, methodologies, processes, architectures, and technologies that transform raw data into meaningful and useful information.

MapReduce: Is a high-level programming model, which uses the “map” and “reduce” functions, for processing high volumes of data.

Data Mining: Refers to non-trivial extraction of implicit, previously unknown and potentially useful information from data.

Big Data: Are interesting high-velocity, high-value, and/or high-variety data with volumes beyond the ability of commonly-used software to capture, manage, and process within a tolerable elapsed time. These Big Data necessitate new forms of processing to deliver high veracity (& low vulnerability) and to enable enhanced decision making, insight, knowledge discovery, and process optimization.

Business Analytics: Refers to the development of skills and technologies, as well as applications and practices, for continuous iterative exploration, investigation, and understanding of past business performance to gain new insight and drive business planning. It aims to develop quantitative processes for a business to reach optimal decisions and to perform business knowledge discovery.

Frequent Pattern Mining: Searches and analyzes high volumes of valuable data for implicit, previously unknown, and potentially useful patterns consisting of frequently co-occurring events or objects. It helps discover frequently collocated trade fairs and frequently purchased bundles of merchandise items.

Complete Chapter List

Search this Book:
Reset