Association Rules and Statistics

Martine Cadot (University of Henri Poincaré/LORIA, Nancy, France), Jean-Baptiste Maj (LORIA/INRIA, France) and Tarek Ziadé (NUXEO, France)
DOI: 10.4018/978-1-60566-010-3.ch016
Available
\$37.50
No Current Special Offers

Abstract

A manager would like to have a dashboard of his company without manipulating data. Usually, statistics have solved this challenge, but nowadays, data have changed (Jensen, 1992); their size has increased, and they are badly structured (Han & Kamber, 2001). A recent method—data mining—has been developed to analyze this type of data (Piatetski-Shapiro, 2000). A specific method of data mining, which fits the goal of the manager, is the extraction of association rules (Hand, Mannila & Smyth, 2001). This extraction is a part of attribute-oriented induction (Guyon & Elisseeff, 2003). The aim of this paper is to compare both types of extracted knowledge: association rules and results of statistics.
Chapter Preview
Top

Main Thrust

The problem differs with the number of variables. In the sequel, problems with two, three, or more variables are discussed.

Two Variables

The link between two variables (A and B) depends on the coding. The outcome of statistics is better when data are quantitative. A current model is linear regression. For instance, the salary (S) of a worker can be expressed by the following equation:S = 100 Y + 20000 + ε (1) where Y is the number of years in the company, and ε is a random number. This model means that the salary of a newcomer in the company is \$20,000 and increases by \$100 per year.

The association rule for this model is: Y→S. This means that there are a few senior workers with a small paycheck. For this, the variables are translated into binary variables. Y is not the number of years, but the property has seniority, which is not quantitative but of type Yes/No. The same transformation is applied to the salary S, which becomes the property “has a big salary.”

Therefore, these two methods both provide the link between the two variables and have their own instruments for measuring the quality of the link. For statistics, there are the tests of regression model (Baillargeon, 1996), and for association rules, there are measures like support, confidence, and so forth (Kodratoff, 2001). But, depending on the type of data, one model is more appropriate than the other (Figure 1).

Complete Chapter List

Search this Book:
Reset