Decision trees are, besides decision rules, one of the most popular forms of knowledge representation in Knowledge Discovery in Databases process (Fayyad, Piatetsky-Shapiro, Smyth & Uthurusamy, 1996) and implementations of the classical decision tree induction algorithms are included in the majority of data mining systems. A hierarchical structure of a tree-based classifier, where appropriate tests from consecutive nodes are subsequently applied, closely resembles a human way of decision making. This makes decision trees natural and easy to understand even for an inexperienced analyst. The popularity of the decision tree approach can also be explained by their ease of application, fast classification and what may be the most important, their effectiveness. Two main types of decision trees can be distinguished by the type of tests in non-terminal nodes: univariate and multivariate decision trees. In the first group, a single attribute is used in each test. For a continuousvalued feature usually an inequality test with binary outcomes is applied and for a nominal attribute mutually exclusive groups of attribute values are associated with outcomes. As a good representative of univariate inducers, the well-known C4.5 system developed by Quinlan (1993) should be mentioned. In univariate trees a split is equivalent to partitioning the feature space with an axis-parallel hyper-plane. If decision boundaries of a particular dataset are not axis-parallel, using such tests may lead to an overcomplicated classifier. This situation is known as the “staircase effect”. The problem can be mitigated by applying more sophisticated multivariate tests, where more than one feature can be taken into account. The most common form of such tests is an oblique split, which is based on a linear combination of features (hyper-plane). The decision tree which applies only oblique tests is often called oblique or linear, whereas heterogeneous trees with univariate, linear and other multivariate (e.g., instance-based) tests can be called mixed decision trees (Llora & Wilson, 2004). It should be emphasized that computational complexity of the multivariate induction is generally significantly higher than the univariate induction. CART (Breiman, Friedman, Olshen & Stone, 1984) and OC1 (Murthy, Kasif & Salzberg, 1994) are well known examples of multivariate systems.
The issue of finding an optimal decision tree for a given classification problem is known to be a difficult optimization task. Naumov (1991) proved that optimal decision tree construction from data is NP-complete under a variety of measures. In this situation it is obvious that a computationally tractable induction algorithm has to be heuristically enhanced. The most popular strategy is based on the top-down approach (Rokach & Maimon, 2005), where a locally optimal search for tests (based, e.g., on a Gini, towing or entropy rule) and data splitting are recursively applied to consecutive subsets of the training data until the stopping condition is met. Usually, the growing phase is followed by post-pruning (Esposito, Malerba & Semeraro, 1997) aimed at increasing generalization power of the obtained classifier and mitigating the risk of the over-fitting to the learning data.
There are problems where the greedy search fails (e.g., the classical chess board problem) and more sophisticated methods are necessary. In this chapter, we present a global approach, where the whole tree (i.e., its structure and all splits) is constructed at the same time. The motivation for this is the fact that top-down induction with, e.g., entropy minimization, makes locally optimal decisions and at least more compact tree can be obtained when it is constructed and assessed in a global way.