Missing Data in OLAP Cubes: Challenges and Strategies

Missing Data in OLAP Cubes: Challenges and Strategies

Monica Chiarini Tremblay, Alan R. Hevner
Copyright: © 2021 |Pages: 28
DOI: 10.4018/JDM.2021070101
OnDemand:
(Individual Articles)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

Online analytical processing (OLAP) engines display aggregated data to help business analysts compare data, observe trends, and make decisions. Issues of data quality and, in particular, issues with missing data impact the quality of the information. Key decision-makers who rely on these data typically make decisions based on what they assume to be all the available data. The authors investigate three approaches to dealing with missing data: 1) ignore missing data, 2) show missing data explicitly (e.g., as unknown data values), and 3) design mitigation algorithms for missing data (e.g., allocate missing data into known value categories). The authors evaluate the approach with focus groups and controlled experiments. When one tries to inform decision-makers using the approaches in the research, the authors find that they often alter their decisions and adjust their decision confidence: individual differences of tolerance for ambiguity and pre-existing omission bias in the decision context influence their decisions.
Article Preview
Top

Introduction

Data warehouse and Online Analytical Processing (OLAP) engines display aggregated data to support business analysts compare data, observe trends, and make decisions (Lee, Pipino, Strong, & Wang, 2004). Despite the maturity of the field, issues of data quality and, in particular, issues with missing data, impact the quality of the displayed information (Nord, Nord, & Xu, 2005; Bimonte, Ren, & Koueya, 2020). There are many reasons that data could be missing (Singh & Singh, 2010), and a myriad of approaches are taken during Extract, Transform and Load (ETL) to deal with missing data (Fatima, Nazir, & Khan, 2017). Key decision-makers who rely on these data are generally not informed about the steps taken to deal with missing data issues, and they will typically make decisions based on what they assume to be all the available data (Chee, Yeoh, Gao, & Richards, 2014).

Research is mixed on the best ways to tag data quality issues, such as missing data (Bai, Nunez, & Kalagnanam, 2012; Chengalur-Smith, Ballou, & Pazer, 1999; Fisher, Chengalur-Smith, & Ballou, 2003; Nicolaou & McKnight, 2006; Parssian, 2006; Price & Shanks, 2011). Moreover, little research has been done to explore the quality of decisions made in the presence of such data quality problems (Subramanian & Wang, 2019). In our investigations, we explore the challenges and the strategies for managing missing data in large repositories. e refer to this challenge as the unallocated data problem. Our focus in this research is on data completeness and not data accuracy (e.g. Gelman, 2012; Chua & Storey, 2016).

Our personal experiences in the design and implementation of a health care data warehouse for public health decision making motivates this research. During a field study of health care business analysts, we first noticed problems with decision-making due to missing data in OLAP cubes. We were studying the implementation and use of a new OLAP interface on a data warehouse used by knowledge workers at a regional health planning agency in Florida (citation removed for blind review). This data warehouse integrated fine-grained event data such as vital statistics (birth and death records), hospital discharge data, and free-standing clinic data, along with several more detailed disease registries. Because of increased needs for data to serve a variety of planning and assessment purposes, it was decided that the health planners would benefit enormously by having direct access to the data warehouse through the use of OLAP interfaces and analysis tools. This field study offered a rich understanding of the health planners' tasks and their use of OLAP data in these tasks.

Observations from this field study led to the identification of critical missing data problems. For example, we struggled with data received from hospitals. Hospital discharges occurred continuously, but not all hospitals chose to send their data at the same rate. Hospitals are continuously collecting data, but they differ in their batching and transmission strategies. Some hospitals would send incomplete data; filling in information with later transmissions. Sometimes the data were set to null values because of privacy and security issues (such as sensitive information on the location of AIDS cases). In another example, we struggled with data on the status of board certifications of physicians, nurses, and other healthcare providers. Central repositories typically maintain this information which is monitored by regulating agencies. Often these data were incomplete because of significant lag-time in updates or inserts to these central repositories. Thus, certification data for specific providers were often unknown.

Unallocated data in OLAP cubes occurs when missing data are part of the joining and grouping query variables which results in some data groupings not being allocated to any of the possible cells in a data cube. Using a design science research approach, we address the following research questions:

  • 1)

    How can we present information on Unallocated Data in OLAP cubes to a decision-maker?

  • 2)

    How does information on Unallocated Data affect a decision-maker’s decision?

Complete Article List

Search this Journal:
Reset
Volume 35: 1 Issue (2024)
Volume 34: 3 Issues (2023)
Volume 33: 5 Issues (2022): 4 Released, 1 Forthcoming
Volume 32: 4 Issues (2021)
Volume 31: 4 Issues (2020)
Volume 30: 4 Issues (2019)
Volume 29: 4 Issues (2018)
Volume 28: 4 Issues (2017)
Volume 27: 4 Issues (2016)
Volume 26: 4 Issues (2015)
Volume 25: 4 Issues (2014)
Volume 24: 4 Issues (2013)
Volume 23: 4 Issues (2012)
Volume 22: 4 Issues (2011)
Volume 21: 4 Issues (2010)
Volume 20: 4 Issues (2009)
Volume 19: 4 Issues (2008)
Volume 18: 4 Issues (2007)
Volume 17: 4 Issues (2006)
Volume 16: 4 Issues (2005)
Volume 15: 4 Issues (2004)
Volume 14: 4 Issues (2003)
Volume 13: 4 Issues (2002)
Volume 12: 4 Issues (2001)
Volume 11: 4 Issues (2000)
Volume 10: 4 Issues (1999)
Volume 9: 4 Issues (1998)
Volume 8: 4 Issues (1997)
Volume 7: 4 Issues (1996)
Volume 6: 4 Issues (1995)
Volume 5: 4 Issues (1994)
Volume 4: 4 Issues (1993)
Volume 3: 4 Issues (1992)
Volume 2: 4 Issues (1991)
Volume 1: 2 Issues (1990)
View Complete Journal Contents Listing