Realistic Data for Testing Rule Mining Algorithms

Realistic Data for Testing Rule Mining Algorithms

Colin Cooper (Kings’ College, UK) and Michele Zito (University of Liverpool, UK)
Copyright: © 2009 |Pages: 6
DOI: 10.4018/978-1-60566-010-3.ch252
OnDemand PDF Download:
No Current Special Offers


The association rule mining (ARM) problem is a wellestablished topic in the field of knowledge discovery in databases. The problem addressed by ARM is to identify a set of relations (associations) in a binary valued attribute set which describe the likely coexistence of groups of attributes. To this end it is first necessary to identify sets of items that occur frequently, i.e. those subsets F of the available set of attributes I for which the support (the number of times F occurs in the dataset under consideration), exceeds some threshold value. Other criteria are then applied to these item-sets to generate a set of association rules, i.e. relations of the form A ? B, where A and B represent disjoint subsets of a frequent item-set F such that A ? B = F. A vast array of algorithms and techniques has been developed to solve the ARM problem. The algorithms of Agrawal & Srikant (1994), Bajardo (1998), Brin, et al. (1997), Han et al. (2000), and Toivonen (1996), are only some of the best-known heuristics. There has been recent growing interest in the class of so-called heavy tail statistical distributions. Distributions of this kind had been used in the past to describe word frequencies in text (Zipf, 1949), the distribution of animal species (Yule, 1925), of income (Mandelbrot, 1960), scientific citations count (Redner, 1998) and many other phenomena. They have been used recently to model various statistics of the web and other complex networks Science (Barabasi & Albert, 1999; Faloutsos et al., 1999; Steyvers & Tenenbaum, 2005).
Chapter Preview

Main Focus

The purpose of this short contribution is two-fold. First, additional arguments are provided supporting the view that real-life databases show structural properties that are very different from those of the data generated by QUEST. Second, a proposal is described for an alternative data generator that is simpler and more realistic than QUEST. The arguments are based on results described in Cooper & Zito (2007).

Complete Chapter List

Search this Book: