Article Preview
Top1. Introduction
One of the vital activities and yet costly in the Software Development Process is Software Testing. It is mandatory and fundamental to manage all the limited resources the authors have in the present outline like workforce, time, monetary etc. To identify the part of software that are more likely to produce error and also requires considerations, software prediction models are useful in this scenario. Software defect prediction is one of the most heated topics at present in Software Engineering domain. Studies from the prediction models states that past data on software bugs in that particular software project can predict defects in its upcoming improvised versions. This approach is termed as Within-Project Defect Prediction (WPDP). The aspect of the training data and the machine learning techniques are used to impel and consume the conjecturing power of Software model. The WPDP examine the defect conjecture models that take up the preceding data, but the clear past records of the data are maintained only by few companies. Within-Project Defect Prediction has a drawback when a project has only limited historical bug related data due to wide pertinence of Cross-Project Defect Prediction, it has been the attraction for the researchers as it reunite and collect training set of the existing models.
To solve this mentioned problem, researchers have tried to apply defect prediction in cross projects by building the models for one project and predicting the other project. This approach is known as Cross-Project Defect Prediction (CPDP). The main aim of CPDP is to predict bug-prone instances (such as classes) in a project based on the data collected from other projects. CPDP is broadly classified into Homogeneous and Heterogeneous CPDP. When the source training project has the same set of features as of target project it is known as Homogeneous CPDP whereas when the target and the source project has different metrics or features then it is termed as Heterogeneous CPDP. The feasibility and potential usefulness of CPDP built with a number of software metrics have been validated, but how to improve the performance of CPDP models is still an open issue. Through various studies it has been also concluded that suitable training data set selection can also improve the performance of the model in defect prediction. Hence training data selection from widely available public repositories is an important research area in CPDP. Since data is collected from different projects in CPDP therefore there is an imbalance in the number of defective and non defective instances. This leads to improper training of the classification model. Such models are usually biased in nature thereby impacting the performance of prediction. Figure 1 shows the difference between WPDP, Homogeneous and Heterogeneous CPDP.
Figure 1. Within project, homogeneous and heterogeneous cross project defect prediction
The objective of this work is to propose a novel defect prediction ensemble model that will perform in a bi-fold manner. Firstly, it will handle the imbalance nature of the dataset. It will partition the training data into seven data where each data frame will have approximately equal number of defect prone and non defect prone classes. Each of the seven data frame is trained. An ensemble model based on the maximum voting of the seven Random Forest is modeled. Secondly, this proposed model will perform cross project defect prediction besides handling the class imbalance problem. 7 fold cross validation is performed to evaluate the training accuracy of the proposed model. Finally, to prove the validity of the model Wilcoxon signed rank test is performed. The research question addressed in this paper is:
RQ. Does the proposed ensemble model outperform the existing models?
The significant contributions of this paper are: