Modeling the KDD Process

Vasudha Bhatnagar (University of Delhi, India) and S. K. Gupta (IIT, Delhi, India)
Knowledge Discovery in Databases (KDD) is classically defined as the “nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in large databases” ( Fayyad, Piatetsky-Shapiro & Smyth, 1996a). The recently developed KDD technology is based on a well-defined, multi-step “KDD process” for discovering knowledge from large data repositories. The basic problem addressed by the KDD process is one of mapping lowlevel data (operational in nature and too voluminous) to a more abstract form (descriptive approximation or model of the process that generated the data) or a useful form (for example, a predictive model) (Fayyad, Piatetsky-Shapiro & Smyth, 1996b). The KDD process evolves with pro-active intervention of the domain experts, data mining analyst and the end-users. It is a ‘continuous’ process in the sense that the results of the process may fuel new motivations for further discoveries (Chapman et al., 2000). Modeling and planning of the KDD process has been recognized as a new research field (John, 2000). In this chapter we provide an introduction to the process of knowledge discovery in databases (KDD process), and present some models (conceptual as well as practical) to carry out the KDD endeavor.
Generic Steps Of The Kdd Process

Figure 1 shows a simple model of the KDD process exhibiting the logical sequencing of the various process steps. The model allows the data miner to effortlessly map the logical process steps (P1 to P7) to the corresponding physical computing processes.

Figure 1.

Steps of the KDD process

The data flows in a straight forward manner from each process step to the subsequent step as shown by solid lines. The dash lines show the control flow and indicate optional iteration of process steps after the discovered knowledge has been evaluated. We describe below the generic steps of a KDD process.

