Class Imbalance Learning to Heterogeneous Cross-Software Projects Defect Prediction

Class Imbalance Learning to Heterogeneous Cross-Software Projects Defect Prediction

Rohit Vashisht, Syed Afzal Murtaza Rizvi
Copyright: © 2022 |Pages: 18
DOI: 10.4018/IJSI.292021
OnDemand:
(Individual Articles)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

Heterogeneous CPDP (HCPDP) attempts to forecast defects in a software application having insufficient previous defect data. Nonetheless, with a Class Imbalance Problem (CIP) perspective, one should have a clear view of data distribution in the training dataset otherwise the trained model would lead to biased classification results. Class Imbalance Learning (CIL) is the method of achieving an equilibrium ratio between two classes in imbalanced datasets. There are a range of effective solutions to manage CIP such as resampling techniques like Over-Sampling (OS) & Under-Sampling (US) methods. The proposed research work employs Synthetic Minority Oversampling TEchnique (SMOTE) and Random Under Sampling (RUS) technique to handle CIP. In addition to this, the paper proposes a novel four-phase HCPDP model and contrasts the efficiency of basic HCPDP model with CIP and after handling CIP using SMOTE & RUS with three prediction pairs. Results show that training performance with SMOTE is substantially improved but RUS displays variations in relation to HCPDP for all three prediction pairs.
Article Preview
Top

Introduction

The prime objective of any software development model is to ensure that the final product or service has the correct level of quality as per the end user’s requirements, called as Software Quality Assurance (SQA). Any deviation from the actual and expected results for some preset environmental configurations can be described as a defect in terms of end-user specifications. Software Development Life Cycle (SDLC)'s most important stage is testing, as it consumes a large proportion of the project's total cost. Therefore, in every software development cycle this step should be focused first. The only way to address this problem is the Software Defect Prediction (SDP) at the right time.

The traditional approach of Defect Prediction (DP) is to identify "Within-Project" defects by slicing the accessible defect dataset into two subsections so that DP model is trained with one subsection of a dataset (referred to as marked cases) and the other subsection is used to test the designed DP model which means finding marks in target application dataset which are either defective or non- defective for unidentifiable instances (Ambros et al .(2012)). Cross Project Defect Prediction (CPDP) is a research field where software project lacking enough local defect data can use data from other projects to create an effective and efficient defect predictor. Clearly, cross-project information needs to be listed before; to promote CPDP as it is applied locally (Han et al. (2011)). Homogeneous CPDP gathers common software metrics/features from both parent application (DP model is trained using it’s defect data) & target application (DP model is designed for this) (He et al. (2014)). But, in case of HCPDP, there is no requirement of common metrics between datasets of prediction pair. Matched metrics can be found by measuring the correlation coefficient among all possible combinations of software features between two applications. To predict project-wide defects among heterogeneous projects, the combinations of feature pairs showing some uniform kind of variations in their values are taken as common features between considered pair of datasets. In this article, the authors are attempting to forecast defects in software application which have features that are entirely heterogeneous from the feature set of the source application and also depriving of defect data for constructing effective DP model. Figure 1 shows a clear disparity between Homogeneous CPDP & Heterogeneous CPDP.

Figure 1.

Classification of Cross Software Projects Defect Prediction

IJSI.292021.f01

The proposed research work offers a four-phase novel HCPDP model for addressing the same problem. In addition to this, it also focuses on uneven ratio of instances in training dataset for a binary classification problem known as Class Imbalance Problem (CIP). In order to resolve this issue, the proposed work employs resampling techniques to achieve CIL for an imbalanced training dataset. SMOTE is used as OS technique & RUS is used as US technique to tackle skewness in distribution of instances in training dataset. The paper addresses the following key areas of study. In order to boost the accuracy of a SDP model, the article's motivation is to investigate and fix this CIP in imbalanced datasets.

  • RQ1. Compare the performance of proposed HCPDP model with CIP and after handling CIP using SMOTE & RUS.

  • RQ2. Which of the two techniques of resampling (SMOTE & RUS) is giving better outcome?

  • RQ3. Contrast the results of defect prediction for two categories which are WPDP & HCPDP.

Complete Article List

Search this Journal:
Reset
Volume 12: 1 Issue (2024)
Volume 11: 1 Issue (2023)
Volume 10: 4 Issues (2022): 2 Released, 2 Forthcoming
Volume 9: 4 Issues (2021)
Volume 8: 4 Issues (2020)
Volume 7: 4 Issues (2019)
Volume 6: 4 Issues (2018)
Volume 5: 4 Issues (2017)
Volume 4: 4 Issues (2016)
Volume 3: 4 Issues (2015)
Volume 2: 4 Issues (2014)
Volume 1: 4 Issues (2013)
View Complete Journal Contents Listing