Article Preview
TopIntroduction
The area of Big Data (BD) is currently subject of intense investigation in academic literature, pushed by the growth of data made available in the Web and collected by fixed and mobile sensors. According to (Dumbill, 2013) “Big data is data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the structures of your database architectures. To gain value from this data, you must choose an alternative way to process it”.
Another issue that in recent years raised the attention of scholars and practitioners is Data Quality (DQ), a multifaceted concept, to the definition of which different dimensions concur. Data quality has been investigated focusing especially on data as represented in the relational model, traditionally adopted in Data Base Management Systems (for an extensive survey of DQ in the relational model see Batini & Scannapieco, 2006), notwithstanding the growing relevance and concerns of non-standard data such as text, music, design information and pictures (Rose, 1991). More recently, a variety of data types rising from linguistic and visual information, used and diffused through social networks, enterprise and public sector information systems as well as the Web, resulted in a deep investigation on how data quality concepts can be extended to such vast set of data types, encompassing, e.g., semi-structured texts, maps, images, linked open data. Thus, the information growth consequent to the BD phenomenon has deeply impacted on the diversity of available types of data, the proliferation of sources of data, and the consequent great expansion of application domains.
Taking the above issues into account, in this paper we investigate how the multifaceted issues making up DQ have evolved from the traditional domain of databases to the domain of BD. The first coordinate we chose to analyze the evolution of the DQ concept are data types adopted in BD. In particular, we will analyze semi-structured texts, maps, and linked open data. Then, we will consider two other coordinates: (ii) the sources that originate BD, and (iii) application domains in which Big Data are used/investigated. As to sources, we will focus on sensors & sensor networks and as to application domains, we will focus on official statistics.
The article is organized as follows. First, we describe the methodology followed in the paper, that adopts an integrative review perspective for a theoretical purpose. Then we present the conceptual framework for analyzing the evolution of the DQ issues from relational databases to the diverse data types, application domains and sources considered in the following. As for DQ issues, we consider dimensions classified in terms of dimensions clusters, adopting the clusters proposed in Batini, Palmonari, and Viscusi (2012). The three BD coordinates, namely data types, sources and application domains are analyzed in terms of their structural characteristics. Subsequently, the evolution paths dealt with in the paper are introduced. Every path considers the evolution of a dimensions cluster from the relational domain to the issues target of the BD coordinates above introduced (i.e., data types, sources and application domains), further showing how the evolution of a given dimension can be interpreted a posteriori according to the structural characteristics considered. A final general discussion on DQ dimension clusters and BD coordinates concludes the paper.