Maintaining data at a high quality is critical to organizational success. Firms, aware of the consequences of poor data quality, have adopted methodologies and policies for measuring, monitoring, and improving it (Redman, 1996; Eckerson, 2002). Today’s quality measurements are typically driven by physical characteristics of the data (e.g., item counts, time tags, or failure rates) and assume an objective quality standard, disregarding the context in which the data is used. The alternative is to derive quality metrics from data content and evaluate them within specific usage contexts. The former approach is termed as structure-based (or structural), and the latter, content-based (Ballou and Pazer, 2003). In this chapter we propose a novel framework to assess data quality within specific usage contexts and link it to data utility (or utility of data) - a measure of the value contribution associated with data within specific usage contexts. Our utility-driven framework addresses the limitations of structural measurements and offers alternative measurements for evaluating completeness, validity, accuracy, and currency, as well as a single measure that aggregates these data quality dimensions.
Data quality is defined as fitness-for-use – the extent to which the data matches the data consumer’s needs (Redman, 1996). However, in real-life settings, a single definition of the data quality may fail to support data management needs (Strong et al, 1997, Lee and Strong, 2003). Kulikowski (1971) suggests that data quality should be measured as a multi-dimensional vector that reflects different aspects of quality. Wang and Strong (1996) show that data customers perceive quality as having multiple dimensions such as accuracy, completeness, and currency. Quality, along each dimension, is often measured as a number between 0 (poor) and 1 (perfect). Pipino et al. (2002) identify three archetypes for quality metrics that adhere to this scale: (a) ratio between the actually obtained and the expected values, (b) min/max value among aggregations and (c) weighted average between multiple factors. Different measurement methods have been proposed along these archetypes (e.g., Redman, 1996; Pipino et al., 2002). Such measurements can be stored as quality metadata (Shankaranarayanan and Even, 2004), presented by software tools (Wang, 1998; Shankaranarayanan and Cai, 2006), tied to visual representations of data processes (Shankaranarayanan et al., 2003), and used for process optimization (Ballou et al., 1998).
Some quality dimensions (e.g., accuracy) are viewed as impartial (Wang and Strong, 1996) - i.e., the perception of quality along these dimensions is based on the data itself, regardless of usage. Others are viewed as contextual quality dimensions and perception of quality depends on the usage context (e.g., relevance). Pipino et al. (2002), however, argue that the same dimension can be measured impartially and/or contextually, depending on the purpose the measurement serves. As both impartial assessment and contextual assessment contribute to the overall perception of data quality, it is important to address both. We posit that within a usage context, the business value of data resources is reflected more by the data content and less by physical characteristics. Hence, we suggest that content-based measurement of quality is more appropriate for contextual assessment. We use utility functions (Ahituv, 1980) to link impartial information characteristics (here, data contents and presence of defects) onto tangible values within specific usages. Utility mapping has been used to examine tradeoffs between quality dimensions and optimize their configuration (Ballou et al., 1998; Ballou and Pazer, 1995, 2003).
Key Terms in this Chapter
Contextual Data Quality Assessment: Perception and measurement of data quality, which reflects its fitness for use within a specific usage context. Contextual assessment may be affected by usage characteristics such as the task, the organizational domain, the timing of usage, and/or the expertise of the individual user.
Structure-Based (or Structural) Data Quality Assessment: Perception and measurement of data quality which is driven by physical characteristics of the data such as item counts, time tags, or failure rates. Structure-based assessment typically assumes an absolute and objective quality standard.
Completeness: A data quality dimension that reflects the inclusion of all the anticipated data and the extent to which exclusion of certain items affects fitness to use.
Accuracy: A data quality dimension that reflects the confirmation of data items to a baseline that is perceived to be correct, and the extent to which conflicts with the correct baseline affects fitness to use. A baseline could be, for example, the real-world value that a data item reflects, a value in another dataset that was reliably validated, or a targeted calculation result.
Content-Based Data Quality Assessment: Perception and measurement of data quality which accounts for content - the actual values stored. Content-based assessment typically links content to a specific usage and does not assumes an absolute and objective quality standard.
Validity: A data quality dimension that reflects the confirmation of data items to their corresponding value domains, and the extent to which non-confirmation of certain items affects fitness to use. For example, a data item is invalid if it is defined to be integer but contains a non-integer value, linked to a finite set of possible values but contains a value not included in this set, or contains a NULL value where a NULL is not allowed.
Data Utility: A measure of the business value attributed to data within specific usage contexts. Utility is typically, but not necessarily, measured in monetary units.
Impartial Data Quality Assessment: Perception and measurement of data quality, which is based on the data itself, regardless of how that data is used.
Currency: A data quality dimension that reflects the degree to which all data items are recent and up to date, and the extent to which non-recency of data items affects fitness to use. A base line could be, for example, the real-world value that a data item reflects, a value in another dataset that was reliably validated, or a targeted calculation result.