Text Mining in Program Code

Alexander Dreweke; Ingrid Fischer; Tobias Werth; Marc Wörlein

doi:10.4018/978-1-59904-990-8.ch035

Special Offers
- IGI Global’s New Emerging Topic e-Book Collections
  Acquire highly focused and affordable Cutting-Edge Peer-Reviewed Research Content through a selection of 17 topic-focused e-Book Collections discounted up to 90%, compared to list prices. Collection topics include Artificial Intelligence, Data Science, Language Learning, Marketing and Customer Relations, Sustainability, and many more. Hosted on the InfoSci^® platform, these collections feature no DRM, no additional cost for multi-user licensing, no embargo of content, full-text PDF & HTML format, and more.
  Learn More
- Open Access Book (Free Access) - Encyclopedia of Information Science and Technology, Sixth Edition (ISBN: 9781668473665)
  The Encyclopedia of Information Science and Technology, Sixth Edition) continues the legacy set forth by the first five editions by providing comprehensive coverage and up-to-date definitions of the most important issues, concepts, and trends pertaining to technological advancements and information management within a variety of settings and industries. The entire book is being published under open access.
  Read Now
- Open Access Book (Free Access) - Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries (ISBN: 9781668456293)
  Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries provides information on the recent technology, mitigation, and environmental protection that must be applied for food sustainability in developing countries. This book is being published under Platinum Open Access through funding from Diponegoro University, Indonesia.
  Read Now
- Open Access Book (Free Access) - New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY (ISBN: 9781668438091)
  The Walmart Corporation and the Lumina Foundation have provided funding to make New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY fully open access, completely removing any paywall between scholars in education and the latest research on new models for the future of higher education.
  Read Now
- Open Access Book (Free Access) - Handbook of Research on the Global View of Open Access and Scholarly Communications (ISBN: 9781799898054)
  Through a collaboration between IGI Global and the University of North Texas, the Handbook of Research on the Global View of Open Access and Scholarly Communications has been published as fully open access, completely removing any paywall between researchers of any field, and the latest research on the equitable and inclusive nature of Open Access and all of its complications.
  Read Now
Books
- - Books by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Books by Field
Journals
- - Journals
  - OnDemand Journal Articles
  - Journals by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Journals by Field
e-Collections
Open Access
- View All Open Access Opportunities
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Find an Open Access Journal for Your Next Manuscript
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Submit an Open Access Book Proposal
  Learn more about open access book publishing and how it can propel your research forward in the field.
  Convert Your Work to Open Access
  Already published? You can convert your work to open access to increase its impact through IGI Global’s Restrospective Open Access Program.
  Utilize Open Access Collection Database
  Open up your research potential by utilizing our open access content or integrating the open access collection into your library
  Consider Open Access Agreements
  For Libraries: consider no-cost or investment-level open access agreements with IGI Global to support your faculty's research endeavors.
  Search Funding Resources
  Looking for additional funding resources to support your open accesss endeavors? View industry resources compiled by our open access team.
  Review Open Access Policies & Ethical Guidelines
  Considering IGI Global to publish your work under open access? Review IGI Global’s open access policies and ethical guidelines
Publish with Us
Resources
- - Instructors
  - Course Adoption
  - Teaching Cases
  - K-12 Online Learning Collection
  - Authors and Editors
  - eEditorial Discovery^® System
  - Peer Review Process
  - Ethics and Malpractice
  - COPE Membership
  - Fair Use Policy
  - Open Access Publishing
  - FAQ
Catalogs
About Us
Newsroom

Text Mining in Program Code

Alexander Dreweke, Ingrid Fischer, Tobias Werth, Marc Wörlein

Source Title: Handbook of Research on Text and Web Mining Technologies

DOI: 10.4018/978-1-59904-990-8.ch035

OnDemand:

(Individual Chapters)

Available

$37.50

Current Special Offers

No Current Special Offers

Abstract

Searching for frequent pieces in a database with some sort of text is a well-known problem. A special sort of text is program code as e.g. C++ or machine code for embedded systems. Filtering out duplicates in large software projects leads to more understandable programs and helps avoiding mistakes when reengineering the program. On embedded systems the size of the machine code is an important issue. To ensure small programs, duplicates must be avoided. Several different approaches for finding code duplicates based on the text representation of the code or on graphs representing the data and control flow of the program and graph mining algorithms.

Chapter Preview

Top

Introduction

Computer programs are a special form of text. Words of a programming languages are combined to form correct sentences in this programming language. There exists a wide variety of programming languages, ranging from high-level object-oriented languages like Java or C++ to machine code, the language a processor can actually “understand”. Programming languages are usually translated with the help of compilers from high- to low-level. To produce this kind of “text” - the computer programs - is the daily work of many programmers; billions of lines of code have been written. Mostly, this code is not well documented and not really understood by anybody after the original programmer stopped working. Typically, many programmers are working on one project and often old code from former versions or other projects is used.

Duplicated code fragments are a special problem in big amounts of program code. These duplicated fragments can occur because of excessive use of “copy & paste”, because something was simply re-programmed or also because of the compiler. When translating from the high-level to intermediate or low-level languages, new duplicates can be introduced, e.g. by using code templates for instructions and instruction sequences.

Finding these duplicates has been in the focus of interest for many years. Code duplicates are called clones and clone detection has produced many different algorithms. If program code is simply viewed as text, clone detection is nothing else than mining in this text with the goal of finding the duplicate or similar code. Merging application areas and algorithms from the data mining community on the one hand and clone detection leads to fruitful new insights and results.

Finding duplicated code in programs can have different goals. First these duplicates can be visualized as a hint for programmers that something has to be done about this specific piece of code. Second, the redundant code can be replaced automatically by subroutine calls, in-lined procedure calls, and macros etc. that produce the same result. This leads to smaller code that is easier to understand or to maintain. Third, methods to detect and replace duplicated code can be integrated into compilers. Finally, finding duplicated code can lead to special hardware for the duplicates in the area of embedded systems.

In the case of program code, duplicates are not always “totally equivalent”. It is not only the one-to-one duplicate from a piece of code that is interesting. Also near duplicates or even pieces of code, that are syntactically different, but semantically equivalent must be found. E.g. in two fragments only two independent pieces of code having no side effect onto each other can be exchanged. Variable names can be different or registers in machine code can vary.

The application of clone detection ranges from high-level languages to machine code for embedded systems. The latter is the main topic in this chapter. The clone detection algorithms especially for embedded systems are described in detail.

Key Terms in this Chapter

Graph Mining: Methods to identify frequent subgraphs in a given graph database.

Code Compaction: The code size reduction of binaries in order to save manufacturing costs or of source code in order to increase maintainability

Clone Detection: Methods to identify code clones i.e. a sequence or set of instructions that is similar or even identical to another one in different kinds of programming code.

Embedded System: Unlike general purpose systems, an Embedded System is used and built for special purpose with special requirements (e.g. real-time operation) and is produced with a large number of units. Therefore, tiny cost savings per piece pay off often.

Suffix Tree: A suffix tree is a data structure used for an efficient detection of duplicated strings in a text.

Slicing: Method to identify code clones based on a CDFG of the program. Based on instructions or their operational code isomorphic subgraphs are grown; similar to graph mining.

Control Data Flow Graph (CDFG): Represents the control flow and the data dependencies in a program.

Procedural Abstraction: The extraction code clones into functions and replacing their occurrences by calls to these new functions.

Fingerprinting: Code fragments are associated with numerical (hash) codes to speed up the detection of code clones. If two code blocks are identical, they have the same fingerprints. If two code blocks have identical fingerprints, these blocks do not necessarily have to be identical.

Regularity Extraction: The realization of code clones in hard logic on an embedded system.

Complete Chapter List

Search this Book:

Reset

MLA

APA

Chicago

Export Reference

Text Mining in Program Code

Abstract

Introduction

Key Terms in this Chapter

Complete Chapter List