Article Preview
Top1. Introduction
In today’s information era, large scale data processing is crucial for any successful business operation. When the size of the data being processed is very large, it becomes impossible for a single machine to handle (Melo et al., 2013; Sommer et al., 2012). Parallel and distributed computing is required to process data at large. One approach to work with such huge amount of data is to rely on a parallel database system. This approach, broadly considered for decades, includes well-known techniques developed and enhanced over time. Parallel database systems feature sophisticated query optimizers, and a rich runtime setting that supports efficient query execution and at the same time, they run only on expensive high-end servers. When the data volumes to be stored and processed reach a point where clusters of hundreds or thousands of machines are required, parallel database solutions become prohibitively expensive (Abouzeid, et al., 2009). Still, the worst part of it is that, at such a scale, many of the primary assumptions of parallel database systems (e.g., fault tolerance) begin to fail, and the conventional solutions are no longer feasible without considerable extensions. MapReduce programming model introduced by Google (Dean & Ghemawat, 2008) provides a very effective tool for tackling large scale data problems in a distributed manner. MapReduce paradigm has received extensive thrust in recent years. MapReduce model is designed to run on clusters of hundreds to thousands of commodity machines connected via a high-bandwidth network and expose a programming model that abstracts distributed group-by-aggregation operations. But beyond that, MapReduce has changed the way computations are organized at a massive scale. MapReduce has enjoyed widespread adoption via an open-source implementation called Hadoop, whose development was led by Yahoo (now an Apache project). Hadoop is an open source implementation framework that provides libraries for distributed computing using simple map/reduce interface and Hadoop distributed file system called HDFS (http://hadoop.apache.org/).
As voluminous data are generated at a faster rate, quality of data is compromised to a certain extent. Poor data quality reduces the value of data or it is sometimes harmful. Functional dependencies (Silberschatz et al., 2009), representing semantic constraints between attributes are important for measuring the amount of inconsistency and redundancy in data and helps in assessing the data quality. In addition to the functional dependencies identified by the data designers, there are FDs hidden in data values. Extracting FDs from large datasets require effective record processing methods. FD is a pattern hidden in data, which requires understanding of the structural properties of data. Existing FD discovery methods like FD_MINE (Yao & Hamilton, 2008), TANE (Huhtala et al., 1999) and FUN (Novelli & Cicchetti, 2001) are designed to work on single processor machines and such approaches will not scale up, when FD discovery is run on large datasets.
Information-theoretic measures have been used in many fields for measuring the importance of attributes and relationships between the attributes based on the structuredness of data (Yao, 2003). Attribute entropy and mutual information (Cover & Thomas,1999) are the important measures used to estimate the attribute importance and inter dependencies between the attributes respectively.
The rationale for using attribute entropy to measure inter attribute dependnecies is that, it
- •
Quantifies information content.
- •
Provides a domain-independent way to reason about structure in data.
- •
Captures the probability distribution of the values of an attribute in a single value.
- •
Indicates the nature of value distribution of attributes, like uniformity, evenness and structuredness.