XML Tree Classification on Evolving Data Streams

XML Tree Classification on Evolving Data Streams

Albert Bifet (University of Waikato, New Zealand) and Ricard Gavaldà (UPC Barcelona Tech, Spain)
Copyright: © 2012 |Pages: 20
DOI: 10.4018/978-1-61350-356-0.ch009


Nowadays, advanced analysis of data streams is quickly becoming a key area of data mining research, as the number of applications demanding such processing increases. Online mining when such data streams evolve over time, that is, when concepts drift or change completely, is becoming one of the core issues. At the same time, closure-based mining on relational data has recently provided some interesting algorithmic developments as well as practical uses. In this chapter we show how to use closure-based mining to reduce drastically the number of attributes in XML tree classification tasks. Moreover, using maximal frequent trees, we reduce even more the number of attributes needed in tree classification, in many cases without losing accuracy. We show a general framework to classify XML trees using subtree occurrence, composing a Tree XML Closed Frequent Miner with a classifier algorithm. We present specific methods that can adaptively mining closed patterns from data streams that change over time.
Chapter Preview


Pattern classification and frequent pattern discovery have possibly become the most important data mining tasks over the last decade. Nowadays, they are becoming harder, as the size of the pattern datasets is increasing, data often comes from sequential, streaming sources, and we cannot assume that data has been generated from a static distribution. If we want accuracy in the results of our algorithms, we have to consider that the distribution that generates data may vary over time, often in an unpredictable and drastic way.

Tree Mining is becoming an important field of research due, among others, to the fact that XML patterns are tree patterns and that XML has become a standard for information representation and exchange over the Internet. XML data is growing and it will soon constitute one of the largest collection of human knowledge. Other applications of tree mining appear in chemical informatics, computer vision, text retrieval, bioinformatics, and Web analysis (Nayak et al., 2009, Denoyer, Gallinari, & Vercoustre, 2006). XML tree classification (Denoyer & Gallinari, 2004) has been done traditionally using information retrieval techniques considering the labels of nodes as bags of words (Campos, Fernandez-Luna, Huete, & Romero, 2008, Yang & Zhang, 2008) without taking into account the structure of the trees (Ceci & Appice, 2006). With the development of frequent tree miners, classification methods using frequent trees appeared (Zaki & Aggarwal, 2003, Kudo & Matsumoto, 2004, Collins & Duffy, 2001, Kashima & Koyanagi, 2002). Recently, closed frequent miners were proposed (Chi, Xia, Yang, & Muntz, 2005, Termier et al., 2008, Arimura & Uno, 2005), and using them for classification tasks is the next natural step (Kutty, Tran, Nayak, & Li, 2008, Candillier et al., 2007).

The main advantage of using closed patterns is that they still contain the essential information about frequent patterns while eliminating redundant one. In this chapter we show how closure-based mining can be used to reduce drastically the number of attributes in tree classification tasks. Also, we show how to use maximal frequent trees to reduce even more the number of attributes needed in tree classification, in many cases without loosing accuracy.

We study and show a general framework to classify XML trees based on subtree occurrence. It is formed by the composition of a tree XML closed frequent miner with a classification algorithm. We discuss specific methods for adaptively dealing with the problem on data streams that vary over time.

The rest of the chapter is organized as follows. We discuss the data stream setting and mention briefly some previous XML classification methods in Section “Background”. Sections “Frequent Pattern Compression” and “Classification using Compressed Frequent Patterns” introduce a tree closure operator and its properties needed for XML classification. Section “XML Tree Classification framework on data streams” shows the tree classification framework and introduces an adaptive closed frequent mining method. Experimental results are discussed in Section “Experimental Evaluation”. Finally, Section “Conclusions and Future Works” concludes this chapter.

Complete Chapter List

Search this Book: