Data Mining and the KDD Process

Data Mining and the KDD Process

Ana Funes (Universidad Nacional de San Luis, Argentina) and Aristides Dasso (Universidad Nacional de San Luis, Argentina)
Copyright: © 2018 |Pages: 15
DOI: 10.4018/978-1-5225-2255-3.ch167


Nowadays, there exists an increasing number of applications where analysis and discovery of new patterns have fueled the research and development of new methods, all related to Machine Learning, Knowledge Extraction, Knowledge Discovery in Databases or KDD, and Data Mining. The development of Data Mining and other related disciplines has benefited from the existence of large volumes of data proceeding from the most diverse sources and domains. KDD process and methods of Data Mining allows for the discovery of knowledge in data that is hidden to humans, presenting this knowledge under different ways. In this chapter, an overview of the KDD process with special focus in the phase of Data Mining is given. A discussion on Data Mining tasks and methods, a possible classification of them, the relation of Data Mining to other disciplines, and an overview of future challenges in the field are also given.
Chapter Preview


There exists some confusion in the use of the terms of Knowledge Discovery in Databases or KDD and Data Mining. Frequently these terms are interchanged, using Data Mining as synonym of KDD. Although they are strongly related, it is important to clarify the differences between them.

Several definitions of Data Mining can be found in the literature. Witten and Frank (2000) refers to Data Mining as the process of extraction of previously-unknown, useful and understandable knowledge from big volumes of data, which can be in different formats and come from different sources. In a much more short way, Hernández-Orallo, Ferri and Ramírez-Quintana (2004) define Data Mining as the process of converting data into knowledge. Sometimes Data Mining is also referred by many other names including knowledge extraction, information discovery, information harvesting, data archeology, and data pattern processing (Fayyad et al, 1996a).

The notion of Data Mining is not new. Since the 60s, other terms as Data Fishing or Data Dredging have been used by statisticians to refer to the idea of finding correlations in data without a previous hypothesis as underlying causality. However, it is not until the late 80s that Data Mining became a discipline of Computer Science and scientific community adopted the term. In fact, as Witten and Frank (2005) point out, the first book on data mining appeared in 1991 (Piatetsky-Shapiro and Frawley, 1991) –a collection of papers presented at a workshop on knowledge discovery in databases in the late 1980s.

Key Terms in this Chapter

Inductive Learning: Induction is the inference of information from data and inductive learning is a model building process where the data are analyzed to find hidden patterns.

Supervised Learning: Learning process of a predictive model from a set of objects, where a supervisor define classes and supply objects of each class. Once the model has been formulated it can be used to predict the class(es) of new objects.

Data Mining: The process of extraction of implicit, previously unknown, and potentially useful knowledge from data. It uses Machine Learning, statistical and visualization techniques to discover and present knowledge in a form that is easily comprehensible to humans. It is a phase in a bigger process: the Knowledge Discovery in Databases (KDD) process.

Unsupervised Learning: Learning process of a descriptive model (patterns) by observation and discovery from a set of unlabeled objects.

Classification: Inductive task where a predictive model is learnt from objects labeled with a class and whereby it is possible to predict the class of new objects.

KDD Process: The KDD process is an iterative process that consists in the selection, cleaning and transformation of data coming not only from databases but also from other heterogeneous sources, such as plain text, data warehouses, images, sound, etc., aimed to apply to them data mining algorithms in order to discover valid, novel, potentially useful, and understandable hidden patterns.

Clustering: Inductive task where a set of unlabeled objects is partitioned into groups (clusters) and where objects in a same cluster have similar characteristics, maximizing the similarity intra cluster and minimizing the similarity inter cluster.

Complete Chapter List

Search this Book: