The vast amounts of digital information stored in databases and other repositories represent a challenge for finding useful knowledge. Traditionalmethods for turning data into knowledge based on manual analysis reach their limits in this context, and for this reason, computer-based methods are needed. Knowledge Discovery in Databases (KDD) is the semi-automatic, nontrivial process of identifying valid, novel, potentially useful, and understandable knowledge (in the form of patterns) in data (Fayyad, Piatetsky-Shapiro, Smyth & Uthurusamy, 1996). KDD is an iterative and interactive process with several steps: understanding the problem domain, data preprocessing, pattern discovery, and pattern evaluation and usage. For discovering patterns, Data Mining (DM) techniques are applied.
The Traditional Framework for DB Querying
Research on query languages and associated evaluation techniques has a long tradition in the database area. Several query languages such as SQL, OQL, and XQUERY have been proposed. They enable the user to retrieve data from a database and filter these data according to specific selection criteria. To evaluate a query Q, the traditional process is as follows. First, Q is syntactically and semantically analyzed to check its syntax and verify if the schema elements referenced in Q exist in the database schema. Second, Q is translated into an expression in a query algebra represented as a query tree QT. Third, QT is optimized using heuristics and cost-based functions to devise an execution plan P with the minimal cost. Finally, P is executed to get the final results.
Key Terms in this Chapter
Decision Rule: A rule of the form ‘if then ’, where is a Boolean combination of attribute tests and is the class assigned to an instance satisfying the conditions.
Inductive Databases: Databases that besides raw data contain inductive generalizations about that data.
Knowledge Discovery in Databases: The nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data.
Prediction Join: An operator for applying a set of decision rules to classify uncategorized data.
Classif ication Model: A set of rules used to predict the class of an instance based on its attribute values.
Object-Relational Databases: The extension of relational databases to include object-oriented concepts such as collections, ADTs, tuple references, and inheritance.
Data Mining: The application of algorithms for discovering patterns in data.
Pattern: An expression in some language representing a high-level description of a dataset.
Inductive Query Language: A query language to perform various operations on data such as data preprocessing, pattern discovery, and pattern postprocessing.
Abstract Data Type (ADT): Specification of a set of data and the set of operations that can be performed on the data.