The discovery of association rules from large amounts of structured or semi-structured data is an important data mining problem [Agrawal et al. 1993, Agrawal and Srikant 1994, Miyahara et al. 2001, Termier et al. 2002, Braga et al. 2002, Cong et al. 2002, Braga et al. 2003, Xiao et al. 2003, Maruyama and Uehara 2000, Wang and Liu 2000]. It has crucial applications in decision support and marketing strategy. The most prototypical application of association rules is market basket analysis using transaction databases from supermarkets. These databases contain sales transaction records, each of which details items bought by a customer in the transaction. Mining association rules is the process of discovering knowledge such as “80% of customers who bought diapers also bought beer, and 35% of customers bought both diapers and beer”, which can be expressed as “diaper ? beer” (35%, 80%), where 80% is the confidence level of the rule, and 35% is the support level of the rule indicating how frequently the customers bought both diapers and beer. In general, an association rule takes the form X ? Y (s, c), where X and Y are sets of items, and s and c are support and confidence, respectively. In the XML Era, mining association rules is confronted with more challenges than in the traditional well-structured world due to the inherent flexibilities of XML in both structure and semantics [Feng and Dillon 2005]. First, XML data has a more complex hierarchical structure than a database record. Second, elements in XML data have contextual positions, which thus carry the order notion. Third, XML data appears to be much bigger than traditional data. To address these challenges, the classic association rule mining framework originating with transactional databases needs to be re-examined.
In the literature, there exist techniques proposed to mine frequent patterns from complex tress and graphs databases. One of the most popular approaches is to use graph matching, which employs data structures like adjacency matrix [Inokuchi et al. 2000] or adjacency list [Kuramochi and Karypis 2001]. Another approach represents semi-structured tree-like structures using a string representation, which is more space efficient and relatively easy for manipulation [Zaki 2002]. This work concentrated on mining frequent tree structures within a forest, which can be extended to for mining frequent tree structures in XML documents. [Zhang et al. 2004, Zhang et al. 2005] proposed a framework, called XAR-Miner, which is directly applicable to mining association rules from XML data. Raw data in the XML document are preprocessed to transform to either an Indexed Content Tree (IX-tree) or Multi-relational databases (Multi-DB), depending on the size of XML document and memory constraint of the system, for efficient data selection and association rule mining. Task-relevant concepts are generalized to produce generalized meta-patterns, based on which the large association rules that meet the support and confidence levels are generated. Recently, Confronted with huge volume of XML data, [Tan, et al. 2005] proposed to generate candidates by model-validating, so that there is no time wasted in deriving invalid candidates which will be discarded at later stages. The algorithm processes an XML document directly taking into account the values of the nodes present in the XML tree, so the frequent item-sets generated contain both node names and values in comparison to the TreeMiner approach, which only generates frequent tree structures. The experiments with both synthetic and real life data sets demonstrate the efficiency of this approach.