In recent years, we have witnessed an explosive growth in the amount of data generated and stored from practically all possible fields (e.g., science, business, medicine, military just to name a few). However, the ability to store more and more data has not been followed by the same rate of growth in the processing power, and, therefore, much of the data accumulated remains today still unanalyzed. Data mining, which could be defined as the process concerned with applying computational techniques (i.e., algorithms implemented as computer programs) to actually find patterns in the data, tries to bridge this gap. Among others, data mining technologies include association rule discovery, classification, clustering, summarization, regression and sequential pattern discovery (Adrians & Zantige, 1996; Chen, Han, & Yu, 1996; Fayyad, Piatetsky-Shapiro, & Smyth, 1996). This problem has been motivated by applications known as market basket analysis which find items purchased by customers; that is, what kinds of products tend to be purchased together (Agrawal, Imielinski, & Swami, 1993).
The goal of the data mining task is to find all frequent itemsets above a user specified threshold (called support) and to generate all association rules above another threshold (called confidence) using these frequent itemsets as input. This type of information could be used for catalogue design, store layout, product placement, target marketing, and so forth. The prototypical application of this task has been the market basket analysis, but the specific model is not limited to it since it can be applied to many other domains (e.g., text documents [Holt & Chung, 2001], census data, [Brin et al., 1997], telecommunication data and even medical images, etc.). In fact, any data set consisting of “baskets” containing multiple “items” can fit this model. Many solutions have been proposed in the last years using a sequential or a parallel paradigm, experimenting on factors such as memory requirements, I/O scans, dimensionality reduction, and so forth.
The specific problem was first introduced by Agrawal et al. (1993) and an algorithm by the name AIS was proposed for effectively addressing it. Agrawal and Srikant (1994) have introduced a much more efficient solution and two new algorithms by the names Apriori and AprioriTid were proposed. Algorithm Apriori has been and still is the major reference point for all subsequent works. Most algorithms and approaches proposed thereafter (Toivonen, 1996; Brin et al., 1997; Park, Chen, & Yu, 1995; Han, Pei, & Yin, 2000) focus on either decreasing the number of passes made over the data or at improving the efficiency of those passes (i.e., by using additional methods for pruning the number of candidates that have to be counted). Among other things studied in association rule mining are: (1) incremental updating (Cheung, Han, Ng, & Wong, 1996; Lee, Lin &Chen, 2001), (2) mining of generalized and multi-level rules (Han & Fu, 1995; Srikant &Agrawal, 1995), (3) using multiple minimum supports (Liu, Hsu, & Ma, 1999), (4) mining of quantitative rules (Srikant & Agrawal, 1996), (5) parallel algorithms (Agrawal & Shafer, 1996; Park, Chen, & Yu, 1995).
Key Terms in this Chapter
Association Rules Mining: Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, and other information repositories.
Data Mining: Analysis of data in a database using tools which look for trends or anomalies without knowledge of the meaning of the data. The nontrivial extraction of implicit, previously unknown, and potentially useful information from data. The science of extracting useful information from large data sets or databases.
Data Clustering: Clustering is the unsupervised classification of patterns (observations, data items, or feature vectors) into groups (clusters).
Support: A statistical measure of importance that indicates the frequencies of the occurring patterns in a rule.
Itemsets: A collection or a combination of items in a database.
Confidence: A statistical measure of importance that denotes the strength of implication.
“Hot” Itemsets: Itemsets that present abnormally high appearances as compared to the appearances of all other items but mainly as compared to their own previous number appearances.
Locally Frequent Itemsets: Itemsets that present a highly localized and condensed appearances behavior as compared to all other itemsets.