Article Preview
TopIntroduction
Software repositories (SR) offer a real opportunity to understand software aspects, enhance software quality, and promote code reuse. The textual data in public SR are mostly unstructured data (Agrawal, Fu, & Menzies, 2018) (Chen, Thomas, & Hassan, A survey on the use of topic models when mining software repositories, 2016) that can be found in many software artefacts, such as source code, email archives, bugs report, etc. To exploit the latent information in these data, the software engineering (SE) community conducted several studies on mining software repositories (MSR) using the information retrieval (IR) technique. Topic models are one of the widely used IR techniques. They are statistical models that discover latent semantic structures in unstructured textual data and cluster them into topics. Where each topic is a set of co-occurring words, and a document is a mixture of topics. Several approaches based on topic models like Latent Dirichlet Allocation (LDA) (Blei, Ng, & Jordan, 2003) and Labelled LDA (LLDA) (Ramage, Hall, Nallapati, & Manning, 2009) were proposed to support SE tasks such as feature location and extraction (Binkley, Lawrie, Uehlinger, & Heinz, 2015) (Sun, Li, Leung, Li, & Li, 2015), traceability link recovery (Hindle, Bird, Zimmermann, & Nagappan, 2015) (Panichella, et al., 2013), software quality and metrics (Chen, Shang, Nagappan, Hassan, & Thomas, 2017) (Hu & Wong, 2013), and software organisation and clustering, (Markovtsev & Kant, 2017) (Sharma, Thung, Kochhar, Sulistya, & Lo, 2017).
Despite their advantages, topic models have some shortcomings such as the dependency of their performance on the selected parameters, uninterpreted topics, and poor performance on short texts. Topic models do not give an interpretation of the generated topics and require an extra step to label them. This step in SE is costly and depends on experts’ knowledge if done manually, especially when dealing with a big number of topics. Some approaches use the words with the highest marginal probability
for labelling. However, previous experiences (Kawaguchi, Garg, Matsushita, & Inoue, 2006), (Markovtsev & Kant, 2017), (Sharma, Thung, Kochhar, Sulistya, & Lo, 2017), (Tian, Revelle, & Poshyvanyk, 2009) showed that even with a handful number of topics, they could take several days to be completely analysed. In a recent survey (Chen, Thomas, & Hassan, 2016), the authors reported: “Labelling and interpreting topics can be difficult and subjective and may require much human effort. Future studies should explore ways to apply different approaches to automatically label the topics” (p. 34).
Some shortcomings were addressed in SE, particularly tuning the parameter (Agrawal, Fu, & Menzies, 2018) (Panichella et al., 2013). Others were treated in NLP and applied by SE community, particularly topics interpretation and labelling (Lau, Grieser, Newman, & Baldwin, 2011) (Ramage, Hall, Nallapati, & Manning, 2009) and adaptation for short texts (Yan, Guo, Lan, & Cheng, 2013) (Jin, Liu, Zhao, Yu, & Yang, 2011). In SE, the automatic labelling approaches use tags extracted from tagged SR to label the topics. These tags are the outcome of the tagging mechanism adopted by many Open-Source SR to organise software and facilitate project search. Some repositories such as SourceForge1, allow only a restricted list of tags. Others such as GitHub2, Freecode3 and Openhub4 allow the project owner and users to add any tag. This absence of controlled vocabulary in some repositories drove the number of tags to increase significantly, which led to the appearance of noisy tags among them. The authors studied 120K GitHub projects and found that there a high risk of spam-tags presence in SR.