Article Preview
Top1. Introduction
Software defect is envisioned as an issue, the presence of which makes the software product to perform abnormally. At present, software defect prediction is thought of as an application area in big data analysis, as it could be possible to collect large amounts of unlabeled software defect metric data at low cost. However, the challenge lies on how to exploit the unlabeled data to predict software defect attracts many researchers during the past few years.
Knowing that Software defect may cause serious consequences in terms of huge financial and human losses in today’s software-intensive system, early detection of defect-prone modules before the release of any new software attracts lots of attention (Ryu et al., 2015; Kamei, and Shihab, 2016). In spite of a lot of research to obtain the quality software is ON, still, the poor performance issues with software reliability become a major concern for the researchers, due to the inherent problems in any one or all of the following: (1) no clear understanding in the requirement. For example: the poor understanding of which subset of attributes are generally responsible for the cause of the software defect, poses difficulty in choosing the right measures for the analysis (Emam et al., 2001; Sandhu et al., 2010) (2) coding errors (Menzies et al., 2007), (3) economics of software defect prediction that may cost heavily when a non-defect module is classified as defect ones (Jiang et al., 2008), (4) insufficient software testing before release (Keiller and Miller, 1991), (5) class imbalance problem where the presence of one class is dominant over the other in the dataset (Batista et al., 2004) and (6) issues with cross-project defect prediction, where the model prediction is done in one project while examining it is done with a different company’s project (Turhan et al., 2009; Watanabe et al., 2008) . It is also envisioned that the software defect prediction shall be carried out in entirety rather than investigating each individual component in isolation and further, the design choice shall be made judiciously in order to avoid the loss of generality and/or to avoid the useless results.
There are two ways through which software defect prediction can be modelled: static model and dynamic model. The static model predicts the number of software defect instances based on various characteristics and metrics of the software product in a subsequent software project. In contrast, dynamic model tries to predict the future defect prone software modules looking into the present and past defects in the subsequent time interval.
Since there is scarcity in getting quality data (e.g. which are usually not only noisy but also suffers from class imbalance problem), data pre-processing followed by the machine learning application are being proposed by many researchers to have better software defect prediction model (Rathore and Kumar, 2017; Khoshgoftaar et al., 2010). Researchers have also opined for a good machine learning model for the classification of defective and non-defective software modules with a consideration that the cost involved in mis-classification of defective ones as non-defective is more than the other and also takes more time for testing the classifier (Khemchandani and Chandra, 2007; Tian et al., 2014; Tomar and Agarwal, 2015).