Article Preview
Top1. Introduction
A software defect is an error in the software that degrades the overall quality of a software product (Tomar & Agarwal, 2016). The occurrence of the software defect is due to the lack of coding experience, misunderstanding of the requirements and poor software testing skills. Software defect prediction (SDP) is a process that predicts the occurrences of defects before they are actually discovered, thereby helping to prioritize the software quality assurance effort and reduce the overall development cost of the software. SDP is important to optimize and streamline the software testing process as it helps in identifying the software components that are likely to contain defects more effectively (S. Wang & Yao, 2013). These inherent advantages of software defect prediction have attracted many researchers to focus on the SDP models. Most software defect prediction models are developed using machine learning techniques to predict the occurrence of defects before they are actually discovered so as to increase the cost-effectiveness of the quality assurance process. However, the performance of traditional SDP models is adversely affected by the imbalanced nature of software defect datasets (Bowes et al., 2014; Menzies et al., 2007)
Various software defect datasets are publicly available to train the SDP models. However, in most of the scenarios, there occurs a great dis-proportionality between the number of defective and non-defective instances in the software defect datasets leading to class imbalance problem (Bowes et al., 2014; Menzies et al., 2007; S. Wang & Yao, 2013) i.e., the software defect datasets contain many more non-defective instances than defective ones. Hence the non-defective instances of the software defect datasets form the majority class, and the defective instances form the minority class. Most of the machine learning algorithms tend to get biased towards the majority non-defective class in case of the class imbalanced datasets because of which the minority defective class instances, which are of more interest are often misclassified (Seiffert et al., 2009). Such misappropriations can prove costly especially in software development where the minority defective class is one that has the highest interest from the learning point of view and also implies a great cost if not classified well (Bhat & Farooq, 2021). As a result, the trained SDP models do not work effectively and realistically in the prediction process. (Song et al., 2018) points out that there is an inverse relationship between the class imbalance ratio and the performance of traditional SDP models, and it was further explored that the imbalanced learning techniques in their right combination with the classifier can mitigate the adverse effect of the class imbalance problem.