Classifying Very High-Dimensional Data with Random Forests Built from Small Subspaces

Classifying Very High-Dimensional Data with Random Forests Built from Small Subspaces

Baoxun Xu (Harbin Institute of Technology Shenzhen Graduate School, China), Joshua Zhexue Huang (Shenzhen Institutes of Advanced Technology and Chinese Academy of Sciences, China), Graham Williams (Shenzhen Institutes of Advanced Technology and Chinese Academy of Sciences, China), Qiang Wang (Harbin Institute of Technology Shenzhen Graduate School, China) and Yunming Ye (Harbin Institute of Technology Shenzhen Graduate School, China)
Copyright: © 2012 |Pages: 20
DOI: 10.4018/jdwm.2012040103
OnDemand PDF Download:
$30.00
List Price: $37.50

Abstract

The selection of feature subspaces for growing decision trees is a key step in building random forest models. However, the common approach using randomly sampling a few features in the subspace is not suitable for high dimensional data consisting of thousands of features, because such data often contains many features which are uninformative to classification, and the random sampling often doesn’t include informative features in the selected subspaces. Consequently, classification performance of the random forest model is significantly affected. In this paper, the authors propose an improved random forest method which uses a novel feature weighting method for subspace selection and therefore enhances classification performance over high-dimensional data. A series of experiments on 9 real life high dimensional datasets demonstrated that using a subspace size of features where M is the total number of features in the dataset, our random forest model significantly outperforms existing random forest models.
Article Preview

1. Introduction

One of the major current challenges to many classical classification algorithms is dealing with large high dimensional data. Examples include text data, microarray data and digital images which often have thousands of features and hundreds of thousands or millions of objects. Such very high dimensional data has two special characteristics that affect the performance of classification algorithms. One is that different classes of objects are present in subspaces of the data dimensions. For example, in text data, documents relating to sport are categorized by the key words describing sport, while documents relating to music are represented by the key words describing music. The other characteristic is that a large number of dimensional features are uninformative to the class feature. That is, many features are only weakly correlated to the class feature, if at all and have a low power in predicting object classes (Saxena & Wang, 2010).

The random forest (Breiman, 2001) algorithm is a popular classification method used to build ensemble models of decision trees from subspaces of data. Experimental results have shown that random forest models can achieve high accuracy in classifying high dimensional data (Banfield et al., 2007). Interest in random forests has grown in many domains where high dimensional data is prominent, including domains such as bioinformatics (Pang et al., 2006; Diaz-Uriarte & De Andres, 2006; Chen & Liu, 2005; Bureau et al., 2005), medical data mining (Ward et al., 2006) and image classification (Bosch, Zisserman, & Muoz, 2007).

Several methods have been proposed to build random forest models from subspaces of data (Breiman, 2001; Ho, 1995, 1998; Dietterich, 2000). Among them, Breiman's method (Breiman, 2001) has been popular due to its good performance compared to other methods (Banfield, Hall, Bowyer, & Kegelmeyer, 2007). Breiman uses a simple random sampling from all the available features to select subspaces when growing unpruned trees within the random forest model. Breiman suggested selecting features in a subspace, where M is the total of independent features in data. This works well for data with certain dimensions (e.g., less than 100 features) but is not suitable for very high dimensional data consisting of thousands of features. In contrast, for very high dimensional data, Breiman’s subspace size of is too small. Such data are dominated by uninformative features which have very low predictive power with respect to the target classification. Using a simple random sampling results in informative features not being included in subspaces (Amaratunga, Cabrera, & Lee, 2008). As a result, weak trees are created and classification performance of the random forest is significantly affected. To increase the chance of selecting informative features in subspaces, the subspace size has to be enlarged extensively. However, this increases the computational requirements of the algorithm and increases the likelihood of the resulting trees being correlated. Correlated trees reduce the classification performance of a random forest model (Breiman, 2001).

Complete Article List

Search this Journal:
Reset
Open Access Articles: Forthcoming
Volume 13: 4 Issues (2017)
Volume 12: 4 Issues (2016)
Volume 11: 4 Issues (2015)
Volume 10: 4 Issues (2014)
Volume 9: 4 Issues (2013)
Volume 8: 4 Issues (2012)
Volume 7: 4 Issues (2011)
Volume 6: 4 Issues (2010)
Volume 5: 4 Issues (2009)
Volume 4: 4 Issues (2008)
Volume 3: 4 Issues (2007)
Volume 2: 4 Issues (2006)
Volume 1: 4 Issues (2005)
View Complete Journal Contents Listing