Comparative Analysis of Random Forests with Statistical and Machine Learning Methods in Predicting Fault-Prone Classes

Comparative Analysis of Random Forests with Statistical and Machine Learning Methods in Predicting Fault-Prone Classes

Ruchika Malhotra (Delhi Technological University, India), Arvinder Kaur (GGS Indraprastha University, India) and Yogesh Singh (GGS Indraprastha University, India)
DOI: 10.4018/978-1-61350-429-1.ch023
OnDemand PDF Download:
$30.00
List Price: $37.50

Abstract

There are available metrics for predicting fault prone classes, which may help software organizations for planning and performing testing activities. This may be possible due to proper allocation of resources on fault prone parts of the design and code of the software. Hence, importance and usefulness of such metrics is understandable, but empirical validation of these metrics is always a great challenge. Random Forest (RF) algorithm has been successfully applied for solving regression and classification problems in many applications. In this work, the authors predict faulty classes/modules using object oriented metrics and static code metrics. This chapter evaluates the capability of RF algorithm and compares its performance with nine statistical and machine learning methods in predicting fault prone software classes. The authors applied RF on six case studies based on open source, commercial software and NASA data sets. The results indicate that the prediction performance of RF is generally better than statistical and machine learning models. Further, the classification of faulty classes/modules using the RF method is better than the other methods in most of the data sets.
Chapter Preview
Top

2 An Overview Of Random Forest (Rf) Algorithms

RF combines the advantages of two machine learning methods bagging and random selection. Bagging makes predictions by majority vote of trees by training each tree on bootstrap sample of the training data. Random feature selection (Amit, 1997; Breiman, 2001) searches at each node for the best split over a random subset of the features. Random features and inputs produce good results (Breiman, 2001).

RF uses randomly selected subset of features in order to split at each node while growing a tree. The main characteristics of RF are (Breiman, 2001):

Complete Chapter List

Search this Book:
Reset