Uncertainty in Concept Hierarchies for Generalization in Data Mining

Uncertainty in Concept Hierarchies for Generalization in Data Mining

Theresa Beaubouef, Frederick Petry
Copyright: © 2013 |Pages: 20
DOI: 10.4018/978-1-4666-3942-3.ch003
OnDemand:
(Individual Chapters)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

Attribute oriented induction is an approach used in data mining to provide summaries of data in a database by the process of generalization that can be used for knowledge discovery in the form of rules or patterns. This is accomplished through the use of a concept hierarchy. When uncertainty is involved in the development and use of the concept hierarchy, the theory behind the uncertainty models in use must first be established. This chapter focuses on providing the foundations for defining imprecise hierarchies and the generalization process with crisp and rough data and hierarchies. Scaling and efficiency issues here involve the problems of creation of appropriate concept hierarchies and the scaling of the generalization process to deal with large databases.
Chapter Preview
Top

Introduction

The world abounds in data, and as technology advances, opportunities for collecting, storing, and using this data increases. The magnitude of such data, as well as its typical lack of organization, however, can prove to be daunting without some means of automatically generating useful information from it. The process of data mining has developed into a useful tool for discovering interesting patterns and relationships in data, and these techniques have benefited information systems and users in a wide variety of fields.

One of the more widely known uses of data mining is for marketing purposes. Often the goal is to predict customer behavior (Chopra, Bhanbri, & Krishan, 2011) or to target selected groups for advertising purposes. Managers can use information from data mining to determine strategies for maximizing results without investing in strategies determined to have lower impact on the bottom line.

In the healthcare industry, data mining can help patients obtain better and less expensive healthcare while providing better information for both healthcare providers and patients (Koh & Tan, 2005; Rafalsky, 2002). It can be used to evaluate treatment practices, help with customer relation management, and detect fraud and insurance abuses. Data mining in healthcare can also alert providers and authorities about possible epidemics and bioterrorism threats (Piazza, 2002).

Data mining is also well established in a variety of scientific and engineering applications (Grossman, Kamath, Kegelmeyer, & Kumar, 2001). In spatial databases and geographic information systems, data contains positional information that often allows for the discovery of patterns involving spatial relationships (Miller & Han, 2001; Kopersky & Han, 1995). It has been used in numerous ways including the study of demographics (Malerba, 2002).

With new technologies, Web usage, social media, and smart devices come additional opportunities for data mining applications. Radio frequency identification (RFID), for example, can generate huge volumes of data and there is a great need for data mining techniques to assist with tracking, business processes, and organization (Kim, Kim, Jung, Kang, &Noh, 2009). Use of data mining for Web use has also been developed (Pohle & Spiliopoulou, 2002).

In data mining applications it is often the case that data or concepts can be generalized in an effort to discover useful patterns or rules in the data. This generalization must be done in some systematic and meaningful way. One approach is through the use of attributed oriented induction which provides summaries of data in a database by generalization. Generalization is achieved by using a concept hierarchy. Specific attribute values in a database tuple are replaced by more general values higher in the hierarchy. The resulting tuples may then be merged, ultimately producing a reduced number of tuples that represent a summarization of the data. This is related to, but more general than the roll-up operation on a data cube. This provides data summarization, a process of grouping of data, enabling transformation of similar item sets, stored originally in a database at the low (primitive) level, into more abstract conceptual representations. When either the data or the generalization process incorporates uncertainty, however, there can be many ways to determine how data is generalized.

Complete Chapter List

Search this Book:
Reset