Article Preview
TopLiterature Review
Data type and amount in human society are growing at an amazing speed which is caused by emerging new services such as cloud computing, the internet of things, and social network, the era of BD has come. Data has been a fundamental resource from a simple dealing object, and how to manage and utilize big data better has attracted much attention (Xiaofeng, & Xiang, 2013). The term ‘big data’ has been in use since the 1990s, with some giving credit to John Mashey, the popular American Computer Scientist for popularizing the term. BD data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process data within a tolerable elapsed time. BD philosophy encompasses unstructured, semi-structured, and structured data; however, the main focus is on unstructured data. BD “size” is a constantly moving target; as the data creation is dynamic and non-stop. It can move fast from few dozen terabytes to many zettabytes of data. BD requires a set of techniques and technologies with new forms of integration to reveal insights from data sets that are diverse, complex, and of a massive scale. Thus, BD is where parallel computing tools are needed to handle data and this is what represents a distinct and clearly defined change in the computer science used, via parallel programming theories, and losses of some of the guarantees and capabilities made by Codd's relational model. BD uses mathematical analysis, optimization, inductive statistics, and concepts from nonlinear system identification to infer laws (regressions, nonlinear relationships, and causal effects) from large sets of data with low information density to reveal relationships and dependencies, or to perform predictions of outcomes and behaviors. Figure 1, below gives the basic framework of BD processing.
Figure 1.
Big data processing basic framework Source: Adapted from Xiaofeng, & Xiang, 2013