Discovery of Frequent Patterns
Let I = {i1, i2, ..., im} be a set of distinct literals called ‘items’. A ‘pattern’, or an ‘itemset’, is a set of items. A ‘transaction’ is a non-empty set of items. A ‘dataset’ is a non-empty set of transactions. A pattern P is said to be contained or included in a transaction T if P ⊆ T. A pattern P is said to be contained in a dataset D, denoted as P ∈ D, if there is T ∈ D such that P ⊆ T. The ‘support count’ of a pattern P in a dataset D, denoted count(P,D), is the number of transactions in D that contain P. The ‘support’ of a pattern P in a dataset D, denoted sup(P,D), is calculated as sup(P,D) = count(P,D)/|D|. Figure 1(a) shows a sample dataset, and all the patterns contained in the sample dataset are enumerated in Figure 1(b) with their support counts.
Figure 1.
(a) An example of transaction dataset. (b) The space of frequent patterns for the sample dataset in (a) when ms%=25% and the concise representations of the space. (c) Decomposition of frequent pattern space into equivalence classes.
A pattern P is said to be frequent in a dataset D if sup(P,D) is greater than or equal to a pre-specified threshold ms%. Given a dataset D and a support threshold ms%, the collection of all frequent itemsets in D is called the ‘space of frequent patterns’, and is denoted by F(ms%, D). The task of frequent pattern mining is to discover all the patterns in the space of frequent patterns. In real-life applications, the size of the frequent pattern space is often tremendous. According to the definition, suppose the dataset has l distinct items, the size of the frequent pattern space can go up to 2l. To increase computational efficiency and reduce memory usage, concise representations are developed to summarize the frequent pattern space.