Hybrid Representation to Locate Vulnerable Lines of Code

Mohammed Zagane (University Mustapha Stambouli of Mascara, Algeria), Mamdouh Alenezi (Prince Sultan University, Riyadh, Saudi Arabia), and Mustapha Kamel Abdi (Université Oran 1, Algeria)

Source Title: International Journal of Software Innovation (IJSI) 10(1)

DOI: 10.4018/IJSI.292020

OnDemand:

(Individual Articles)

Available

$37.50

Current Special Offers

No Current Special Offers

Abstract

Locating vulnerable lines of code in large software systems needs huge efforts from human experts. This explains the high costs in terms of budget and time needed to correct vulnerabilities. To minimize these costs, automatic solutions of vulnerabilities prediction have been proposed. Existing machine learning (ML)-based solutions face difficulties in predicting vulnerabilities in coarse granularity and in defining suitable code features that limit their effectiveness. To addressee these limitations, in the present work, the authors propose an improved ML-based approach using slice-based code representation and the technique of TF-IDF to automatically extract effective features. The obtained results showed that combining these two techniques with ML techniques allows building effective vulnerability prediction models (VPMs) that locate vulnerabilities in a finer granularity and with excellent performances (high precision (>98%), low FNR (<2%) and low FPR (<3%) which outperforms software metrics and are equivalent to the best performing recent deep learning-based approaches.

Article Preview

Top

1 Introduction

Cyber attacks represent a serious problem that can cause disastrous social, economical, and reputational damage to individuals as well as to big companies and governments. Most of these attacks are conducted by taking advantage of vulnerabilities that exist in software systems. Hence, early detection and correction of vulnerabilities before software deliverance is necessary to avoid exploiting them by attackers. Because manual detection is very hard and very costly in terms of budgets and time especially in big software projects, it is crucial to use automatic solutions that help in detecting vulnerability to minimize as much as possible the human intervention that minimizes the costs.

Online resources such as public open-source software repositories and public online vulnerabilities databases such as Common Vulnerabilities and Exposures (CVE) and National Vulnerability Database (NVD) enabled researchers to prepare labeled data to propose data-driven solutions of Automatic Vulnerability Prediction (AVP). The underlining motivation is data-driven approaches have been successfully applied in other fields such as image processing and pattern recognition to solve challenging problems and static code analyzers have proven to miss many vulnerabilities and incur high false positives (Li, Zou, Xu, Ou, et al., 2018). The state of the art ML-based approaches in the field of AVP; deal with the problem of vulnerabilities like a supervised classification problem; based on using ML techniques and manually-defined features such as software metrics and Bag-Of-Words (BOW), vulnerabilities prediction models (VPMs) are built. VPMs can indicate software components (files, classes, and functions/methods) that may contain vulnerabilities; this can help developers focus their efforts on the components that are most apparent to be vulnerable and hence minimize budgets and times needed to detect and correct vulnerabilities.

The traditional Machine Learning (ML)-based approaches have limitations induced by the fact that features are manually defined. This means that the quality of the resulting features, and therefore the effectiveness of the resulting detection system, varies with the individuals who define them. Another major drawback of these approaches lies in the fact that important semantic characteristics of the code which may give insight about vulnerabilities cannot be captured by using software metrics or traditional text features such as BOW(Li, Zou, Xu, Ou, et al., 2018). In addition to the limitations mentioned before, the ML-based approaches suffer from another limitation inherited by the coarse granularity level in which software metrics are calculated which means that vulnerabilities cannot be located in much fine granularity. In a previous work (Zagane et al., 2020b), the authors have tried to improve the software metrics-based approach. They have tried to combat the limitation of the coarse granularity by proposing to calculate metrics at the slice granularity which allowed to improve the performances of the proposed VPMs and to locate with much precision the vulnerable lines. In the present study, the authors propose an improvement of this previous work. Instead of using software metrics which is a type of manually-defined features; in the present study, the authors propose to build VPMs using automatically-extracted features based on a slice-based code representation and the TF-IDF techniques.

In the recent Deep Learning (DL)-based approaches, researchers have adopted techniques inspired by the field of Natural Language Processing (NLP) such as word embedding. In these approaches, word embedding is used to transform source code to vectors suitable to be used as inputs for Deep Neural Networks (DNNs). The DNNs then learn from the vectorized code (the input vectors) deep-learned features. This means that the final features used to classify the code as vulnerable or clean are learned via hidden layers (generally: Long Short-Term Memory (LSTM) or Convolutional layers) of the DNN. In the present study, the difference lies in the fact that the final features are extracted directly using a method based on the technique of TF-IDF without the need for DNNs that require challenging hyperparameters tuning and a huge computational power to train.

The present study makes the following two contributions:

Complete Article List

Search this Journal:

Reset

Volume 13: 1 Issue (2025)

Volume 12: 1 Issue (2024)

Volume 11: 1 Issue (2023)

Volume 10: 4 Issues (2022): 2 Released, 2 Forthcoming

Volume 9: 4 Issues (2021)

Volume 8: 4 Issues (2020)

Volume 7: 4 Issues (2019)

Volume 6: 4 Issues (2018)

Volume 5: 4 Issues (2017)

Volume 4: 4 Issues (2016)

Volume 3: 4 Issues (2015)

Volume 2: 4 Issues (2014)

Volume 1: 4 Issues (2013)

View Complete Journal Contents Listing

MLA

APA

Chicago

Export Reference

Hybrid Representation to Locate Vulnerable Lines of Code

Abstract

1 Introduction

Complete Article List