Data Warehouses and Big Data: How to Cope With Data Quality

Data Warehouses and Big Data: How to Cope With Data Quality

Hamid Naceur Benkhaled (EEDIS Laboratory, University of Djillali Liabes, Sidi Bel Abbes, Algeria), Djamel Berrabah (EEDIS Laboratory, University of Djillali Liabes, Sidi Bel Abbes, Algeria) and Faouzi Boufares (LIPN Laboratory, Paris13 University, Paris, France)
Copyright: © 2020 |Pages: 13
DOI: 10.4018/IJOCI.2020070101

Abstract

Before the arrival of the Big Data era, data warehouse (DW) systems were considered the best decision support systems (DSS). DW systems have always helped organizations around the world to analyse their stored data and use it in making decisive decisions. However, analyzing and mining data of poor quality can give the wrong conclusions. Several data quality (DQ) problems can appear during a data warehouse project like missing values, duplicates values, integrity constrains issues and more. As a result, organizations around the world are more aware of the importance of data quality and invest a lot of money in order to manage data quality in the DW systems. On the other hand, with the arrival of BD, new challenges have to be considered like the need for collecting the most recent data and the ability to make real-time decisions. This article provides a survey about the exiting techniques to control the quality of the stored data in the DW systems and the new solutions proposed in the literature to face the new Big Data requirements.
Article Preview
Top

1. Introduction

To best explore the mountains of data that exist within organizations and across the web, data quality is becoming increasingly important. Indeed, data quality is a major issue in an organization and has a significant impact on the quality of its services and profitability. Decision-making using data of poor quality has a negative influence on the activities of organizations. Anomalies are only detected at the level of data restitution (such as analyses or visualizations), which is too late!

For the decision-makers, it would be recommended to integrate various data in order to create new ones including databases, data warehouses, data marts, data lakes, and master data. In an era of data deluge, data quality is more important than ever (Figure 1). There are multiple data sources: social networks; web; open data; dark data (dormant data not yet used; a lot of unstructured textual data). Indeed, nowadays, any type of organization needs to integrate data from various distributed sources which heterogeneous and of varying quality. In most cases, data descriptions in the sources are poor or nonexistent. As a result, the data assembly may be meaningless and the result obtained may contain many anomalies. The problems that lead to poor quality of the manipulated data could be the following: (i) heterogeneous data when integrated; (ii) different levels of data description (little or no description at all) and (iii) lack of semantics (Zaidi et al., 2015).

As mentioned above, data warehouse (DW) systems are among technologies used to integrate data. Before the arrival of the Big Data (BD) era, data warehouse systems were considered as the most powerful decision support system. DW systems have always helped organizations around the world to exploit their stored data and use it to a take an advantage over the competitors in the market.

Although DW systems have proven their standing over the years, they can sometimes fail to meet the stakeholder’s expectations or give the right decisions. Indeed, many DW projects have been cancelled due to data quality (DQ) problems. DQ problems can appears in different ways like missing values, duplicates records (Benkhaled et al., 2019) (Ouhab et al., 2017) or the referential integrity problems. Poor quality data causes losses estimated at about $ 600 million annually in the USA alone (information reported by the Data Warehousing Institute) This Institute also mentioned that 15% to 20% of the stored data in most of the enterprises is of poor data quality (Geiger, 2004). Consequently, companies’ leaders can lose their trust in the DW systems and look for other solutions since DQ problems can increase the cost of the Data Warehouse projects.

However, with the arrival of the Big Data era, adapting the traditional DW systems to the new Big Data challenges was one of the main active research fields. Most of the Big Data applications need to execute near-real times analyzing (Like Internet of Things) which was not the case with the traditional DW systems (Meehan et al., 2017), specifically, the ETL (extraction, transformation, and loading) process which is considered as the most time-consuming step during the DW life cycle. Previously, DW systems were not impacted by the latency of ETL since near-real-time decisions were not a necessity (Berkani et al., 2013).

Even with the new requirements of Big Data, some of the DW systems community researchers still defending it over BD. DW gives the users the possibility of executing many queries on the same stored data which is not possible with BD because data is not stored. If a user wants to execute another query, a Data Lake should be implemented which stores the most important unstructured data (Feugey, 2016).

Complete Article List

Search this Journal:
Reset
Open Access Articles
Volume 11: 4 Issues (2021): Forthcoming, Available for Pre-Order
Volume 10: 4 Issues (2020): 3 Released, 1 Forthcoming
Volume 9: 4 Issues (2019)
Volume 8: 4 Issues (2018)
Volume 7: 4 Issues (2017)
Volume 6: 4 Issues (2016)
Volume 5: 4 Issues (2015)
Volume 4: 4 Issues (2014)
Volume 3: 4 Issues (2012)
Volume 2: 4 Issues (2011)
Volume 1: 4 Issues (2010)
View Complete Journal Contents Listing