Distance-Based Methods for Association Rule Mining

Distance-Based Methods for Association Rule Mining

Vladimír Bartík (Brno University of Technology, Czech Republic)
Copyright: © 2009 |Pages: 6
DOI: 10.4018/978-1-60566-010-3.ch107
OnDemand PDF Download:
$37.50

Abstract

Association rules are one of the most frequently used types of knowledge discovered from databases. The problem of discovering association rules was first introduced in (Agrawal, Imielinski & Swami, 1993). Here, association rules are discovered from transactional databases –a set of transactions where a transaction is a set of items. An association rule is an expression of a form A?B where A and B are sets of items. A typical application is market basket analysis. Here, the transaction is the content of a basket and items are products. For example, if a rule milk ? juice ? coffee is discovered, it is interpreted as: “If the customer buys milk and juice, s/he is likely to buy coffee too.” These rules are called single-dimensional Boolean association rules (Han & Kamber, 2001). The potential usefulness of the rule is expressed by means of two metrics – support and confidence. A lot of algorithms have been developed for mining association rules in transactional databases. The best known is the Apriori algorithm (Agrawal & Srikant, 1994), which has many modifications, e.g. (Kotásek & Zendulka, 2000). These algorithms usually consist of two phases: discovery of frequent itemsets and generation of association rules from them. A frequent itemset is a set of items having support greater than a threshold called minimum support. Association rule generation is controlled by another threshold referred to as minimum confidence. Association rules discovered can have a more general form and their mining is more complex than mining rules from transactional databases. In relational databases, association rules are ordinarily discovered from data of one table (it can be the result of joining several other tables). The table can have many columns (attributes) defined on domains of different types. It is useful to distinguish two types of attributes. A categorical attribute (also called nominal) has a finite number of possible values with no ordering among the values (e.g. a country of a customer). A quantitative attribute is a numeric attribute, domain of which is infinite or very large. In addition, it has an implicit ordering among values (e.g. age and salary of a customer). An association rule (Age = [20…30]) ? (Country = “Czech Rep.”) ? (Salary = [1000$...2000$]) says that if the customer is between 20 and 30 and is from the Czech Republic, s/he is likely to earn between 1000$ and 2000$ per month. Such rules with two or more predicates (items) containing different attributes are also called multidimensional association rules. If some attributes of rules are quantitative, the rules are called quantitative association rules (Han & Kamber, 2001). If a table contains only categorical attributes, it is possible to use modified algorithms for mining association rules in transactional databases. The crucial problem is to process quantitative attributes because their domains are very large and these algorithms cannot be used. Quantitative attributes must be discretized. This article deals with mining multidimensional association rules from relational databases, with main focus on distance-based methods. One of them is a novel method developed by the authors.
Chapter Preview
Top

Background

There are three basic approaches regarding the treatment of quantitative attributes (Han & Kamber, 2001). First one uses a predefined set of ranges (or, in general, a concept hierarchy) to replace the original numeric values of a quantitative attribute by ranges that represent intervals of values. This discretization occurs prior to applying a mining algorithm. It is static and predetermined. In the second approach, quantitative attributes are initially discretized statically. The resulting ranges are then combined during the mining algorithm. Therefore, the discretization process is dynamic. The third approach tries to define ranges based on semantic meaning of the data. This discretization is dynamic too. It considers distance between data points. Discretization and mining methods based on this approach are referred to as distance-based methods.

Complete Chapter List

Search this Book:
Reset