Statistical Data Editing

Statistical Data Editing

Claudio Conversano (University of Cagliari, Italy) and Roberta Siciliano (University of Naples, Federico II, Italy)
Copyright: © 2009 |Pages: 6
DOI: 10.4018/978-1-60566-010-3.ch280
OnDemand PDF Download:
$37.50

Abstract

Statistical Data Editing (SDE) is the process of checking and correcting data for errors. Winkler (1999) defines it the set of methods used to edit (clean-up) and impute (fill-in) missing or contradictory data. The result of SDE is data that can be used for analytic purposes. Editing literature goes back to 60’s with the contributions of Nordbotten (1965), Pritzker et al. (1965) and Freund and Hartley (1967). A first mathematical formalization of the editing process is in Naus et al. (1972), who introduce a probabilistic criterion for the identification of records (or the part of them) that failed the editing process. A solid methodology for generalized editing and imputation systems is developed in Fellegi and Holt (1976). The great break in rationalizing the process came as a direct consequence of the PC evolution in the 80’s: Editing started to be performed on-line on PCs even during the interview and by the respondent in computer assisted self-interviewing (CASI) models of data collection (Bethlehem et al., 1989). Nowadays, SDE is a research topic in academia and statistical agencies. The European Economic Commission periodically organizes a workshop on the subject concerning both scientific and managerial aspects of SDE (www.unece.org/stats).
Chapter Preview
Top

Background

Before the computers advent, editing was performed by large groups of persons undertaking very simple checks and detecting only a small fraction of errors. The computers evolution allowed survey designers and managers to review all records by consistently applying even sophisticated checks to detect most of the errors in the data that could not be found manually. The focus of both methodologies and applications was on the possibilities of enhancing the checks and of applying automated imputation rules to rationalize the process.

SDE Process

Statistical organizations periodically perform a SDE process. It begins with data collection. An interviewer can quickly examine the respondent answers and highlight gross errors. Whenever data collection is performed using a computer, more complex edits can be stored in it in advance and can be applied to data just before their transmission to a central database. In such cases, the core of editing activity is performed after completing data collection. Nowadays, any modern editing process is based on the a-priori specification of a set of edits, i.e., logical conditions or restrictions on data values. A given set of edits is not necessarily correct: important edits may be omitted and conceptually wrong, too restrictive or logically inconsistent edits may be included. The extent of these problems is reduced by a subject-matter expert edits specification. Problems are not eliminated, however, because many surveys involve large questionnaires and require the complex specification of hundreds of edits. As a check, a proposed set of edits is applied on test data with known errors before application on real data. Missing edits or logically inconsistent ones, however, may not be detected at this stage. Problems in the edits, if discovered during the actual editing or even after it, cause editing to start anew after their correction, leading to delays and incurring larger costs than expected. Any method or procedure which would assist in the most efficient specification of edits would therefore be welcome.

The final result of a SDE process is the production of clean data and the indication of the underlying causes of errors in the data. Usually, an editing software is able to produce reports indicating frequent errors in the data. The analysis of such reports allows to investigate the data error generation causes and to improve the results of future surveys in terms of data quality. Elimination of sources of errors in a survey allow a data collector agency to save money.

SDE Activities

SDE concerns two aspects of data quality; (1) Data Validation: the correction of logical errors in the data; (2) Data Imputation: the imputation of correct values once errors in data have been localized. Whenever missing values appear in data, missing data treatment is part of the data imputation process to be performed in the SDE framework.

Complete Chapter List

Search this Book:
Reset