A Framework for Homogeneous Cross-Project Defect Prediction

A Framework for Homogeneous Cross-Project Defect Prediction

Lipika Goel, Mayank Sharma, Sunil Kumar Khatri, D. Damodaran
Copyright: © 2021 |Pages: 17
DOI: 10.4018/IJSI.2021010105
OnDemand:
(Individual Articles)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

Often, the prior defect data of the same project is unavailable; researchers thought whether the defect data of the other projects can be used for prediction. This made cross project defect prediction an open research issue. In this approach, the training data often suffers from class imbalance problem. Here, the work is directed on homogeneous cross-project defect prediction. A novel ensemble model that will perform in dual fold is proposed. Firstly, it will handle the class imbalance problem of the dataset. Secondly, it will perform the prediction of the target class. For handling the imbalance problem, the training dataset is divided into data frames. Each data frame will be balanced. An ensemble model using the maximum voting of all random forest classifiers is implemented. The proposed model shows better performance in comparison to the other baseline models. Wilcoxon signed rank test is performed for validation of the proposed model.
Article Preview
Top

1. Introduction

One of the vital activities and yet costly in the Software Development Process is Software Testing. It is mandatory and fundamental to manage all the limited resources the authors have in the present outline like workforce, time, monetary etc. To identify the part of software that are more likely to produce error and also requires considerations, software prediction models are useful in this scenario. Software defect prediction is one of the most heated topics at present in Software Engineering domain. Studies from the prediction models states that past data on software bugs in that particular software project can predict defects in its upcoming improvised versions. This approach is termed as Within-Project Defect Prediction (WPDP). The aspect of the training data and the machine learning techniques are used to impel and consume the conjecturing power of Software model. The WPDP examine the defect conjecture models that take up the preceding data, but the clear past records of the data are maintained only by few companies. Within-Project Defect Prediction has a drawback when a project has only limited historical bug related data due to wide pertinence of Cross-Project Defect Prediction, it has been the attraction for the researchers as it reunite and collect training set of the existing models.

To solve this mentioned problem, researchers have tried to apply defect prediction in cross projects by building the models for one project and predicting the other project. This approach is known as Cross-Project Defect Prediction (CPDP). The main aim of CPDP is to predict bug-prone instances (such as classes) in a project based on the data collected from other projects. CPDP is broadly classified into Homogeneous and Heterogeneous CPDP. When the source training project has the same set of features as of target project it is known as Homogeneous CPDP whereas when the target and the source project has different metrics or features then it is termed as Heterogeneous CPDP. The feasibility and potential usefulness of CPDP built with a number of software metrics have been validated, but how to improve the performance of CPDP models is still an open issue. Through various studies it has been also concluded that suitable training data set selection can also improve the performance of the model in defect prediction. Hence training data selection from widely available public repositories is an important research area in CPDP. Since data is collected from different projects in CPDP therefore there is an imbalance in the number of defective and non defective instances. This leads to improper training of the classification model. Such models are usually biased in nature thereby impacting the performance of prediction. Figure 1 shows the difference between WPDP, Homogeneous and Heterogeneous CPDP.

Figure 1.

Within project, homogeneous and heterogeneous cross project defect prediction

IJSI.2021010105.f01

The objective of this work is to propose a novel defect prediction ensemble model that will perform in a bi-fold manner. Firstly, it will handle the imbalance nature of the dataset. It will partition the training data into seven data where each data frame will have approximately equal number of defect prone and non defect prone classes. Each of the seven data frame is trained. An ensemble model based on the maximum voting of the seven Random Forest is modeled. Secondly, this proposed model will perform cross project defect prediction besides handling the class imbalance problem. 7 fold cross validation is performed to evaluate the training accuracy of the proposed model. Finally, to prove the validity of the model Wilcoxon signed rank test is performed. The research question addressed in this paper is:

RQ. Does the proposed ensemble model outperform the existing models?

The significant contributions of this paper are:

  • 1.

    To develop a model to handle the class imbalance problem in CPDP.

  • 2.

    To develop a standalone ensemble framework for cross project defect prediction.

Complete Article List

Search this Journal:
Reset
Volume 12: 1 Issue (2024)
Volume 11: 1 Issue (2023)
Volume 10: 4 Issues (2022): 2 Released, 2 Forthcoming
Volume 9: 4 Issues (2021)
Volume 8: 4 Issues (2020)
Volume 7: 4 Issues (2019)
Volume 6: 4 Issues (2018)
Volume 5: 4 Issues (2017)
Volume 4: 4 Issues (2016)
Volume 3: 4 Issues (2015)
Volume 2: 4 Issues (2014)
Volume 1: 4 Issues (2013)
View Complete Journal Contents Listing