Article Preview
TopIntroduction
Usually, open source software projects are released while they still contain bugs. So, projects utilize bug tracking systems, such as Bugzilla, to manage the bug fixes during the maintenance phases (Hamdy & El-Laithy, 2020). When a bug is found by a user or a developer, it is reported through the bug tracking system by means of a bug report. The bug report is a description of the bug in natural language. Sometimes stack traces are copied to the bug report too. If the bug is confirmed, it is assigned to a developer (bug fixer) to fix it (Hamdy & El-laithy, 2019). The bug fixer searches the project source code repository to locate the faulty source files in order to fix the bug; this process is called bug localization. Bug localization is a time consuming task, especially for large software projects. The first reason is that it is hard to locate the faulty source file(s). The second reason, is that there is usually a large number of bugs, e.g. at the early releases of Mozilla and Eclipse, about 170 and 120 bugs respectively were reported daily (Hamdy & El-Laithy, 2020).
Information retrieval (IR) techniques have been widely used for automating bug localization task, where the submitted bug report is treated as a query, then the top N similar source files are retrieved and ranked (Akbar & Kak, Jun. 2020). IR-based bug localization approaches could be utilized with software bug repositories of different sizes, however, their performance depend on the features extracted from each of the source files and bug reports such as: 1) Textual and semantic similarities between a submitted bug report and source files, 2) Similarity between a submitted bug report and previously fixed ones, 3) Change history of source code files. Authors leveraged one or more of these features (Zhou, Zhang, & Lo, 2012), (Zhou, Tong, Chen, & Han, Aug. 2017). The approach proposed by (Gharibi, Rasekh, Sadreddini, & Fakhrahmad, Nov. 2018) is one of the comprehensive approaches that utilized most of the important features except the source code change history. The change history of source code is a very important feature as the source files that were modified are likely to include bugs and may get complex. Furthermore, modifications are usually implemented under strict deadlines to reduce the cost. So, the developers usually do not take into consideration the guidelines for a clean code; which leads to the existence of code smells and consequently the occurrence of bugs (Hamdy & Tazy, 2020).
With the breakthrough of deep learning (DL) techniques and their performance advance in several fields including software engineering, e.g. Bug severity prediction (Hamdy & Ezzat, 2020), code smells detection (Hamdy & Tazy, 2020). Several DL-based bug localization approaches have been proposed in the literature (Y. Xiao, 2019), (Liang, Sun, Wang, & Yang, 2019), (Sanglea, Muvvaa, Chimalakondaa, Ponnalagub, & Venkoparao, 2020). In these approaches a deep learning model is trained to classify the source files as faulty or not, with regard to a submitted bug report. However, these models require a large amount of historical data (previously fixed bug reports) to be used in the training, so the trained model does not overfit. Consequently, DL- based approaches could be utilized only with very large software bug repositories, that include a vast number of previously fixed bug reports. Some authors (Sanglea, Muvvaa, Chimalakondaa, Ponnalagub, & Venkoparao, 2020) used oversampling techniques, in order to generate synthetic data to train the DL model.