Practical experience of data mining has revealed that preparing data is the most time-consuming phase of any data mining project. Estimates of the amount of time and resources spent on data preparation vary from at least 60% to upward of 80% (SPSS, 2002a). In spite of this fact, not enough attention is given to this important task, thus perpetuating the idea that the core of the data mining effort is the modeling process rather than all phases of the data mining life cycle. This article presents an overview of the most important issues and considerations for preparing data for data mining.
In this article, we address the issue of data preparation—how to make the data more suitable for data mining. Data preparation is a broad area and consists of a number of different approaches and techniques that are interrelated in complex ways. For the purpose of this article we consider data preparation to include the tasks of data selection, data reorganization, data exploration, data cleaning, and data transformation. These tasks are discussed in detail in subsequent sections.
It is important to note that the existence of a well designed and constructed data warehouse, a special database that contains data from multiple sources that are cleaned, merged, and reorganized for reporting and data analysis, may make the step of data preparation faster and less problematic. However, the existence of a data warehouse is not necessary for successful data mining. If the data required for data mining already exist, or can be easily created, then the existence of a data warehouse is immaterial.