Extracting Functional Dependencies in Large Datasets Using MapReduce Model

Extracting Functional Dependencies in Large Datasets Using MapReduce Model

K. Amshakala, R. Nedunchezhian, M. Rajalakshmi
Copyright: © 2014 |Pages: 17
DOI: 10.4018/ijiit.2014070102
(Individual Articles)
No Current Special Offers


Over the last few years, data are generated in large volume at a faster rate and there has been a remarkable growth in the need for large scale data processing systems. As data grows larger in size, data quality is compromised. Functional dependencies representing semantic constraints in data are important for data quality assessment. Executing functional dependency discovery algorithms on a single computer is hard and laborious with large data sets. MapReduce provides an enabling technology for large scale data processing. The open-source Hadoop implementation of MapReduce has provided researchers a powerful tool for tackling large-data problems in a distributed manner. The objective of this study is to extract functional dependencies between attributes from large datasets using MapReduce programming model. Attribute entropy is used to measure the inter attribute correlations, and exploited to discover functional dependencies hidden in the data.
Article Preview

1. Introduction

In today’s information era, large scale data processing is crucial for any successful business operation. When the size of the data being processed is very large, it becomes impossible for a single machine to handle (Melo et al., 2013; Sommer et al., 2012). Parallel and distributed computing is required to process data at large. One approach to work with such huge amount of data is to rely on a parallel database system. This approach, broadly considered for decades, includes well-known techniques developed and enhanced over time. Parallel database systems feature sophisticated query optimizers, and a rich runtime setting that supports efficient query execution and at the same time, they run only on expensive high-end servers. When the data volumes to be stored and processed reach a point where clusters of hundreds or thousands of machines are required, parallel database solutions become prohibitively expensive (Abouzeid, et al., 2009). Still, the worst part of it is that, at such a scale, many of the primary assumptions of parallel database systems (e.g., fault tolerance) begin to fail, and the conventional solutions are no longer feasible without considerable extensions. MapReduce programming model introduced by Google (Dean & Ghemawat, 2008) provides a very effective tool for tackling large scale data problems in a distributed manner. MapReduce paradigm has received extensive thrust in recent years. MapReduce model is designed to run on clusters of hundreds to thousands of commodity machines connected via a high-bandwidth network and expose a programming model that abstracts distributed group-by-aggregation operations. But beyond that, MapReduce has changed the way computations are organized at a massive scale. MapReduce has enjoyed widespread adoption via an open-source implementation called Hadoop, whose development was led by Yahoo (now an Apache project). Hadoop is an open source implementation framework that provides libraries for distributed computing using simple map/reduce interface and Hadoop distributed file system called HDFS (http://hadoop.apache.org/).

As voluminous data are generated at a faster rate, quality of data is compromised to a certain extent. Poor data quality reduces the value of data or it is sometimes harmful. Functional dependencies (Silberschatz et al., 2009), representing semantic constraints between attributes are important for measuring the amount of inconsistency and redundancy in data and helps in assessing the data quality. In addition to the functional dependencies identified by the data designers, there are FDs hidden in data values. Extracting FDs from large datasets require effective record processing methods. FD is a pattern hidden in data, which requires understanding of the structural properties of data. Existing FD discovery methods like FD_MINE (Yao & Hamilton, 2008), TANE (Huhtala et al., 1999) and FUN (Novelli & Cicchetti, 2001) are designed to work on single processor machines and such approaches will not scale up, when FD discovery is run on large datasets.

Information-theoretic measures have been used in many fields for measuring the importance of attributes and relationships between the attributes based on the structuredness of data (Yao, 2003). Attribute entropy and mutual information (Cover & Thomas,1999) are the important measures used to estimate the attribute importance and inter dependencies between the attributes respectively.

The rationale for using attribute entropy to measure inter attribute dependnecies is that, it

  • Quantifies information content.

  • Provides a domain-independent way to reason about structure in data.

  • Captures the probability distribution of the values of an attribute in a single value.

  • Indicates the nature of value distribution of attributes, like uniformity, evenness and structuredness.

Complete Article List

Search this Journal:
Volume 19: 1 Issue (2023)
Volume 18: 4 Issues (2022): 3 Released, 1 Forthcoming
Volume 17: 4 Issues (2021)
Volume 16: 4 Issues (2020)
Volume 15: 4 Issues (2019)
Volume 14: 4 Issues (2018)
Volume 13: 4 Issues (2017)
Volume 12: 4 Issues (2016)
Volume 11: 4 Issues (2015)
Volume 10: 4 Issues (2014)
Volume 9: 4 Issues (2013)
Volume 8: 4 Issues (2012)
Volume 7: 4 Issues (2011)
Volume 6: 4 Issues (2010)
Volume 5: 4 Issues (2009)
Volume 4: 4 Issues (2008)
Volume 3: 4 Issues (2007)
Volume 2: 4 Issues (2006)
Volume 1: 4 Issues (2005)
View Complete Journal Contents Listing