Text Mining in Program Code

Text Mining in Program Code

Alexander Dreweke (Friedrich-Alexander University Erlangen-Nuremberg, Germany), Ingrid Fischer (University of Konstanz, Germany), Tobias Werth (Friedrich-Alexander University Erlangen-Nuremberg, Germany) and Marc Wörlein (Friedrich-Alexander University Erlangen-Nuremberg, Germany)
Copyright: © 2009 |Pages: 20
DOI: 10.4018/978-1-59904-990-8.ch035
OnDemand PDF Download:
No Current Special Offers


Searching for frequent pieces in a database with some sort of text is a well-known problem. A special sort of text is program code as e.g. C++ or machine code for embedded systems. Filtering out duplicates in large software projects leads to more understandable programs and helps avoiding mistakes when reengineering the program. On embedded systems the size of the machine code is an important issue. To ensure small programs, duplicates must be avoided. Several different approaches for finding code duplicates based on the text representation of the code or on graphs representing the data and control flow of the program and graph mining algorithms.
Chapter Preview


Computer programs are a special form of text. Words of a programming languages are combined to form correct sentences in this programming language. There exists a wide variety of programming languages, ranging from high-level object-oriented languages like Java or C++ to machine code, the language a processor can actually “understand”. Programming languages are usually translated with the help of compilers from high- to low-level. To produce this kind of “text” - the computer programs - is the daily work of many programmers; billions of lines of code have been written. Mostly, this code is not well documented and not really understood by anybody after the original programmer stopped working. Typically, many programmers are working on one project and often old code from former versions or other projects is used.

Duplicated code fragments are a special problem in big amounts of program code. These duplicated fragments can occur because of excessive use of “copy & paste”, because something was simply re-programmed or also because of the compiler. When translating from the high-level to intermediate or low-level languages, new duplicates can be introduced, e.g. by using code templates for instructions and instruction sequences.

Finding these duplicates has been in the focus of interest for many years. Code duplicates are called clones and clone detection has produced many different algorithms. If program code is simply viewed as text, clone detection is nothing else than mining in this text with the goal of finding the duplicate or similar code. Merging application areas and algorithms from the data mining community on the one hand and clone detection leads to fruitful new insights and results.

Finding duplicated code in programs can have different goals. First these duplicates can be visualized as a hint for programmers that something has to be done about this specific piece of code. Second, the redundant code can be replaced automatically by subroutine calls, in-lined procedure calls, and macros etc. that produce the same result. This leads to smaller code that is easier to understand or to maintain. Third, methods to detect and replace duplicated code can be integrated into compilers. Finally, finding duplicated code can lead to special hardware for the duplicates in the area of embedded systems.

In the case of program code, duplicates are not always “totally equivalent”. It is not only the one-to-one duplicate from a piece of code that is interesting. Also near duplicates or even pieces of code, that are syntactically different, but semantically equivalent must be found. E.g. in two fragments only two independent pieces of code having no side effect onto each other can be exchanged. Variable names can be different or registers in machine code can vary.

The application of clone detection ranges from high-level languages to machine code for embedded systems. The latter is the main topic in this chapter. The clone detection algorithms especially for embedded systems are described in detail.

Key Terms in this Chapter

Graph Mining: Methods to identify frequent subgraphs in a given graph database.

Code Compaction: The code size reduction of binaries in order to save manufacturing costs or of source code in order to increase maintainability

Clone Detection: Methods to identify code clones i.e. a sequence or set of instructions that is similar or even identical to another one in different kinds of programming code.

Embedded System: Unlike general purpose systems, an Embedded System is used and built for special purpose with special requirements (e.g. real-time operation) and is produced with a large number of units. Therefore, tiny cost savings per piece pay off often.

Suffix Tree: A suffix tree is a data structure used for an efficient detection of duplicated strings in a text.

Slicing: Method to identify code clones based on a CDFG of the program. Based on instructions or their operational code isomorphic subgraphs are grown; similar to graph mining.

Control Data Flow Graph (CDFG): Represents the control flow and the data dependencies in a program.

Procedural Abstraction: The extraction code clones into functions and replacing their occurrences by calls to these new functions.

Fingerprinting: Code fragments are associated with numerical (hash) codes to speed up the detection of code clones. If two code blocks are identical, they have the same fingerprints. If two code blocks have identical fingerprints, these blocks do not necessarily have to be identical.

Regularity Extraction: The realization of code clones in hard logic on an embedded system.

Complete Chapter List

Search this Book: