Component-Based Decision Trees: Empirical Testing on Data Sets of Account Holders in the Montenegrin Capital Market

Component-Based Decision Trees: Empirical Testing on Data Sets of Account Holders in the Montenegrin Capital Market

Ljiljana Kašćelan (Faculty of Economics, University of Montenegro, Podgorica, Montenegro) and Vladimir Kašćelan (Faculty of Economics, University of Montenegro, Podgorica, Montenegro)
DOI: 10.4018/IJORIS.2015100101
OnDemand PDF Download:
$30.00
List Price: $37.50

Abstract

Popular decision tree (DT) algorithms such as ID3, C4.5, CART, CHAID and QUEST may have different results using same data set. They consist of components which have similar functionalities. These components implemented on different ways and they have different performance. The best way to get an optimal DT for a data set is one that use component-based design, which enables user to intelligently select in advance implemented components well suited to specific data set. In this article the authors proposed component-based design of the optimal DT for classification of securities account holders. Research results showed that the optimal algorithm is not one of the original DT algorithms. This fact confirms that the component design provided algorithms with better performance than the original ones. Also, the authors found how the specificities of the data influence the DT components performance. Obtained results of classification can be useful to the future investors in the Montenegrin capital market.
Article Preview

Existing DT algorithms are usually implemented with “black-box”- approach. Thereby the user specifies input data and parameters used for definition of appropriate model. Induction procedure is hidden from the user. The user has no possibility to change algorithms in order to improve such results. These algorithms are very hard to analyze and evaluate, because it is hard to determine which part of the algorithm had influence on its performance. Certain part of the algorithm is the best with one data set, while with the other data set, corresponding part of some other algorithm can be better. In this approach performance testing of different parts of algorithms over data sets, as well as combination of the most efficient parts from different algorithms, are not possible. Better performance with these algorithms can be achieved with incremental improvement of existing algorithms.

One of the first “black-box” DT algorithms is ID3 (Quinlan, 1986). This algorithm works only with categorical variables, it is based on “multi-way”-split and it uses “Information Gain”- measure for split quality. This evaluation measure is biased towards choosing attributes with more categories. Breiman, Friedman, Stone and Olshen (1984), proposed CART algorithm which works with both categorical and numerical variables, and for split evaluation it uses “Gini” measure. The algorithm supports only “binary”- splits. Algorithm C4.5 (Quinlan, 1993) is improvement of ID3 algorithm which can work both, with categorical and numerical data. It uses “multi-way”- split for categorical, and “binary” for numerical data. For split evaluation it uses “Gain Ratio”-measure, which is not biased towards attributes with several categories. It also includes three pruning algorithms: reduced error pruning, pessimistic error pruning and error based pruning. CHAID algorithm was proposed by Kass (1980). In this algorithm Chi-square test is used for evaluation of the split quality. QUEST algorithm (Loah & Shih, 1997), uses removal of insignificant attributes with chi-square test, for categorical, and ANOVA f-test, for numerical data.

Complete Article List

Search this Journal:
Reset
Open Access Articles: Forthcoming
Volume 8: 4 Issues (2017)
Volume 7: 4 Issues (2016)
Volume 6: 4 Issues (2015)
Volume 5: 4 Issues (2014)
Volume 4: 4 Issues (2013)
Volume 3: 4 Issues (2012)
Volume 2: 4 Issues (2011)
Volume 1: 4 Issues (2010)
View Complete Journal Contents Listing