Automatically Labelled Software Topic Model

Automatically Labelled Software Topic Model

Youcef Bouziane, Mustapha Kamel Abdi, Salah Sadou
Copyright: © 2020 |Pages: 22
DOI: 10.4018/IJOSSP.2020010104
OnDemand:
(Individual Articles)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

Public software repositories (SR) maintain a massive amount of valuable data offering opportunities to support software engineering (SE) tasks. Researchers have applied information retrieval techniques in mining software repositories. Topic models are one of these techniques. However, this technique does not give an interpretation nor labels to the extracted topics and it requires manual analysis to identify them. Some approaches were proposed to automatically label the topics using tags in SR, but they do not consider the existence of spam-tags and they have difficulties to scale to large tag space. This article introduces a novel approach called automatically labelled software topic model (AL-STM) that labels the topics based on observed tags in SR. It mitigates the shortcomings of manual and automatic labelling of topics in SE. AL-STM is implemented using 22K GitHub projects and evaluated in a SE task (tag recommending) against the currently used techniques. The empirical results suggest that AL-STM is more robust in terms of MAP and nDCG, and more scalable to large tag space.
Article Preview
Top

Introduction

Software repositories (SR) offer a real opportunity to understand software aspects, enhance software quality, and promote code reuse. The textual data in public SR are mostly unstructured data (Agrawal, Fu, & Menzies, 2018) (Chen, Thomas, & Hassan, A survey on the use of topic models when mining software repositories, 2016) that can be found in many software artefacts, such as source code, email archives, bugs report, etc. To exploit the latent information in these data, the software engineering (SE) community conducted several studies on mining software repositories (MSR) using the information retrieval (IR) technique. Topic models are one of the widely used IR techniques. They are statistical models that discover latent semantic structures in unstructured textual data and cluster them into topics. Where each topic is a set of co-occurring words, and a document is a mixture of topics. Several approaches based on topic models like Latent Dirichlet Allocation (LDA) (Blei, Ng, & Jordan, 2003) and Labelled LDA (LLDA) (Ramage, Hall, Nallapati, & Manning, 2009) were proposed to support SE tasks such as feature location and extraction (Binkley, Lawrie, Uehlinger, & Heinz, 2015) (Sun, Li, Leung, Li, & Li, 2015), traceability link recovery (Hindle, Bird, Zimmermann, & Nagappan, 2015) (Panichella, et al., 2013), software quality and metrics (Chen, Shang, Nagappan, Hassan, & Thomas, 2017) (Hu & Wong, 2013), and software organisation and clustering, (Markovtsev & Kant, 2017) (Sharma, Thung, Kochhar, Sulistya, & Lo, 2017).

Despite their advantages, topic models have some shortcomings such as the dependency of their performance on the selected parameters, uninterpreted topics, and poor performance on short texts. Topic models do not give an interpretation of the generated topics and require an extra step to label them. This step in SE is costly and depends on experts’ knowledge if done manually, especially when dealing with a big number of topics. Some approaches use the words with the highest marginal probability IJOSSP.2020010104.m01 for labelling. However, previous experiences (Kawaguchi, Garg, Matsushita, & Inoue, 2006), (Markovtsev & Kant, 2017), (Sharma, Thung, Kochhar, Sulistya, & Lo, 2017), (Tian, Revelle, & Poshyvanyk, 2009) showed that even with a handful number of topics, they could take several days to be completely analysed. In a recent survey (Chen, Thomas, & Hassan, 2016), the authors reported: “Labelling and interpreting topics can be difficult and subjective and may require much human effort. Future studies should explore ways to apply different approaches to automatically label the topics” (p. 34).

Some shortcomings were addressed in SE, particularly tuning the parameter (Agrawal, Fu, & Menzies, 2018) (Panichella et al., 2013). Others were treated in NLP and applied by SE community, particularly topics interpretation and labelling (Lau, Grieser, Newman, & Baldwin, 2011) (Ramage, Hall, Nallapati, & Manning, 2009) and adaptation for short texts (Yan, Guo, Lan, & Cheng, 2013) (Jin, Liu, Zhao, Yu, & Yang, 2011). In SE, the automatic labelling approaches use tags extracted from tagged SR to label the topics. These tags are the outcome of the tagging mechanism adopted by many Open-Source SR to organise software and facilitate project search. Some repositories such as SourceForge1, allow only a restricted list of tags. Others such as GitHub2, Freecode3 and Openhub4 allow the project owner and users to add any tag. This absence of controlled vocabulary in some repositories drove the number of tags to increase significantly, which led to the appearance of noisy tags among them. The authors studied 120K GitHub projects and found that there a high risk of spam-tags presence in SR.

Complete Article List

Search this Journal:
Reset
Volume 15: 1 Issue (2024): Forthcoming, Available for Pre-Order
Volume 14: 1 Issue (2023)
Volume 13: 4 Issues (2022): 1 Released, 3 Forthcoming
Volume 12: 4 Issues (2021)
Volume 11: 4 Issues (2020)
Volume 10: 4 Issues (2019)
Volume 9: 4 Issues (2018)
Volume 8: 4 Issues (2017)
Volume 7: 4 Issues (2016)
Volume 6: 1 Issue (2015)
Volume 5: 3 Issues (2014)
Volume 4: 4 Issues (2012)
Volume 3: 4 Issues (2011)
Volume 2: 4 Issues (2010)
Volume 1: 4 Issues (2009)
View Complete Journal Contents Listing