Special Offers
- IGI Global’s New Emerging Topic e-Book Collections
  Acquire highly focused and affordable Cutting-Edge Peer-Reviewed Research Content through a selection of 17 topic-focused e-Book Collections discounted up to 90%, compared to list prices. Collection topics include Artificial Intelligence, Data Science, Language Learning, Marketing and Customer Relations, Sustainability, and many more. Hosted on the InfoSci^® platform, these collections feature no DRM, no additional cost for multi-user licensing, no embargo of content, full-text PDF & HTML format, and more.
  Learn More
- Open Access Book (Free Access) - Encyclopedia of Information Science and Technology, Sixth Edition (ISBN: 9781668473665)
  The Encyclopedia of Information Science and Technology, Sixth Edition) continues the legacy set forth by the first five editions by providing comprehensive coverage and up-to-date definitions of the most important issues, concepts, and trends pertaining to technological advancements and information management within a variety of settings and industries. The entire book is being published under open access.
  Read Now
- Open Access Book (Free Access) - Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries (ISBN: 9781668456293)
  Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries provides information on the recent technology, mitigation, and environmental protection that must be applied for food sustainability in developing countries. This book is being published under Platinum Open Access through funding from Diponegoro University, Indonesia.
  Read Now
- Open Access Book (Free Access) - New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY (ISBN: 9781668438091)
  The Walmart Corporation and the Lumina Foundation have provided funding to make New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY fully open access, completely removing any paywall between scholars in education and the latest research on new models for the future of higher education.
  Read Now
- Open Access Book (Free Access) - Handbook of Research on the Global View of Open Access and Scholarly Communications (ISBN: 9781799898054)
  Through a collaboration between IGI Global and the University of North Texas, the Handbook of Research on the Global View of Open Access and Scholarly Communications has been published as fully open access, completely removing any paywall between researchers of any field, and the latest research on the equitable and inclusive nature of Open Access and all of its complications.
  Read Now
Books
- - Books by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Books by Field
Journals
- - Journals
  - OnDemand Journal Articles
  - Journals by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Journals by Field
e-Collections
OnDemand
Open Access
- View All Open Access Opportunities
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Find an Open Access Journal for Your Next Manuscript
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Submit an Open Access Book Proposal
  Learn more about open access book publishing and how it can propel your research forward in the field.
  Convert Your Work to Open Access
  Already published? You can convert your work to open access to increase its impact through IGI Global’s Restrospective Open Access Program.
  Utilize Open Access Collection Database
  Open up your research potential by utilizing our open access content or integrating the open access collection into your library
  Consider Open Access Agreements
  For Libraries: consider no-cost or investment-level open access agreements with IGI Global to support your faculty's research endeavors.
  Search Funding Resources
  Looking for additional funding resources to support your open accesss endeavors? View industry resources compiled by our open access team.
  Review Open Access Policies & Ethical Guidelines
  Considering IGI Global to publish your work under open access? Review IGI Global’s open access policies and ethical guidelines
Publish with Us
Resources
- - Instructors
  - Course Adoption
  - Teaching Cases
  - K-12 Online Learning Collection
  - Authors and Editors
  - eEditorial Discovery^® System
  - Peer Review Process
  - Ethics and Malpractice
  - COPE Membership
  - Fair Use Policy
  - Open Access Publishing
  - FAQ
Catalogs
About Us

Automatically Labelled Software Topic Model

Youcef Bouziane, Mustapha Kamel Abdi, Salah Sadou

Source Title: International Journal of Open Source Software and Processes (IJOSSP) 11(1)

DOI: 10.4018/IJOSSP.2020010104

OnDemand:

(Individual Articles)

Available

$37.50

Current Special Offers

No Current Special Offers

Abstract

Public software repositories (SR) maintain a massive amount of valuable data offering opportunities to support software engineering (SE) tasks. Researchers have applied information retrieval techniques in mining software repositories. Topic models are one of these techniques. However, this technique does not give an interpretation nor labels to the extracted topics and it requires manual analysis to identify them. Some approaches were proposed to automatically label the topics using tags in SR, but they do not consider the existence of spam-tags and they have difficulties to scale to large tag space. This article introduces a novel approach called automatically labelled software topic model (AL-STM) that labels the topics based on observed tags in SR. It mitigates the shortcomings of manual and automatic labelling of topics in SE. AL-STM is implemented using 22K GitHub projects and evaluated in a SE task (tag recommending) against the currently used techniques. The empirical results suggest that AL-STM is more robust in terms of MAP and nDCG, and more scalable to large tag space.

Article Preview

Top

Introduction

Software repositories (SR) offer a real opportunity to understand software aspects, enhance software quality, and promote code reuse. The textual data in public SR are mostly unstructured data (Agrawal, Fu, & Menzies, 2018) (Chen, Thomas, & Hassan, A survey on the use of topic models when mining software repositories, 2016) that can be found in many software artefacts, such as source code, email archives, bugs report, etc. To exploit the latent information in these data, the software engineering (SE) community conducted several studies on mining software repositories (MSR) using the information retrieval (IR) technique. Topic models are one of the widely used IR techniques. They are statistical models that discover latent semantic structures in unstructured textual data and cluster them into topics. Where each topic is a set of co-occurring words, and a document is a mixture of topics. Several approaches based on topic models like Latent Dirichlet Allocation (LDA) (Blei, Ng, & Jordan, 2003) and Labelled LDA (LLDA) (Ramage, Hall, Nallapati, & Manning, 2009) were proposed to support SE tasks such as feature location and extraction (Binkley, Lawrie, Uehlinger, & Heinz, 2015) (Sun, Li, Leung, Li, & Li, 2015), traceability link recovery (Hindle, Bird, Zimmermann, & Nagappan, 2015) (Panichella, et al., 2013), software quality and metrics (Chen, Shang, Nagappan, Hassan, & Thomas, 2017) (Hu & Wong, 2013), and software organisation and clustering, (Markovtsev & Kant, 2017) (Sharma, Thung, Kochhar, Sulistya, & Lo, 2017).

Despite their advantages, topic models have some shortcomings such as the dependency of their performance on the selected parameters, uninterpreted topics, and poor performance on short texts. Topic models do not give an interpretation of the generated topics and require an extra step to label them. This step in SE is costly and depends on experts’ knowledge if done manually, especially when dealing with a big number of topics. Some approaches use the words with the highest marginal probability for labelling. However, previous experiences (Kawaguchi, Garg, Matsushita, & Inoue, 2006), (Markovtsev & Kant, 2017), (Sharma, Thung, Kochhar, Sulistya, & Lo, 2017), (Tian, Revelle, & Poshyvanyk, 2009) showed that even with a handful number of topics, they could take several days to be completely analysed. In a recent survey (Chen, Thomas, & Hassan, 2016), the authors reported: “Labelling and interpreting topics can be difficult and subjective and may require much human effort. Future studies should explore ways to apply different approaches to automatically label the topics” (p. 34).

Some shortcomings were addressed in SE, particularly tuning the parameter (Agrawal, Fu, & Menzies, 2018) (Panichella et al., 2013). Others were treated in NLP and applied by SE community, particularly topics interpretation and labelling (Lau, Grieser, Newman, & Baldwin, 2011) (Ramage, Hall, Nallapati, & Manning, 2009) and adaptation for short texts (Yan, Guo, Lan, & Cheng, 2013) (Jin, Liu, Zhao, Yu, & Yang, 2011). In SE, the automatic labelling approaches use tags extracted from tagged SR to label the topics. These tags are the outcome of the tagging mechanism adopted by many Open-Source SR to organise software and facilitate project search. Some repositories such as SourceForge¹, allow only a restricted list of tags. Others such as GitHub², Freecode³ and Openhub⁴ allow the project owner and users to add any tag. This absence of controlled vocabulary in some repositories drove the number of tags to increase significantly, which led to the appearance of noisy tags among them. The authors studied 120K GitHub projects and found that there a high risk of spam-tags presence in SR.

Complete Article List

Search this Journal:

Reset

Volume 15: 1 Issue (2024): Forthcoming, Available for Pre-Order

Volume 14: 1 Issue (2023)

Volume 13: 4 Issues (2022): 1 Released, 3 Forthcoming

Volume 12: 4 Issues (2021)

Volume 11: 4 Issues (2020)

Volume 10: 4 Issues (2019)

Volume 9: 4 Issues (2018)

Volume 8: 4 Issues (2017)

Volume 7: 4 Issues (2016)

Volume 6: 1 Issue (2015)

Volume 5: 3 Issues (2014)

Volume 4: 4 Issues (2012)

Volume 3: 4 Issues (2011)

Volume 2: 4 Issues (2010)

Volume 1: 4 Issues (2009)

View Complete Journal Contents Listing

MLA

APA

Chicago

Export Reference

Automatically Labelled Software Topic Model

Abstract

Introduction

Complete Article List