Article Preview
TopIntroduction
Software Defect Prediction (SDP) models enable quality support teams to predict defect prone software artifacts in advance, which in-turn helps in effective resource allocation and utilization. The development of SDP model starts with data collection from various software repositories. This collected data is expected to be complete, however, it is sometimes incomplete and noisy. This incomplete data may be the result of cross-company nature of dataset or some computational errors (García et al., 2015; Lakshminarayan et al., 1999). The performance of SDP model developed from an incomplete dataset is questionable and also, many machine-learning algorithms won’t be able to process such datasets.
There are many ways to deal with an incomplete dataset like - deletion techniques, toleration techniques, and imputation techniques. Deletion techniques recommend the deletion of all the instances, which include missing values, thus resulting in loss of important data. In toleration techniques, missing values are replaced by mean/ mode values, which is also not the best alternative method. Imputation is the most appropriate technique, which estimates missing values by analyzing the observed/available data. Researchers have proposed various imputation algorithms that are based on the accuracy of classifiers, which are trained using imputed values in the training dataset (Batista and Monard, 2003; Farhangfar et al., 2007; Saar-Tsechansky and Provost, 2007). Most of the imputation algorithms (Ma et al., 2006; Song et al., 2011) are trained under supervised learning, which uses complete dataset as the training dataset to compute missing values in the test dataset. Thus, complete dataset’s quality affects the performance of the imputation technique, which in turn affects the performance of SDP model.
Data preprocessing techniques are used to deal with the other issue related to collected data, i.e. noisy data. Instance selection and feature selection are two significant data preprocessing steps, which aim to eliminate noisy data and reduce the size of data set by filtering out non-relevant software metrics (Gupta, 2013; Gao and Khoshgoftaar, 2014 and García et al., 2015; Kale). Feature Selection methods select most relevant software metrics, which contribute maximum to the prediction process. Instance selection methods select the most relevant instances, which contribute to the prediction process.
In this study, the authors investigate prediction capability of SDP model if either instance selection or feature selection is performed as an additional step in combination with the imputation technique. In other words, we examine which one of the two- instance selection or feature selection is more advantageous in the process of SDP model building with missing value dataset.