Data Mining in Programs: Clustering Programs Based on Structure Metrics and Execution Values

Data Mining in Programs: Clustering Programs Based on Structure Metrics and Execution Values

TianTian Wang, KeChao Wang, XiaoHong Su, Lin Liu
Copyright: © 2020 |Pages: 16
DOI: 10.4018/IJDWM.2020040104
(Individual Articles)
No Current Special Offers


Software exists in various control systems, such as security-critical systems and so on. Existing program clustering methods are limited in identifying functional equivalent programs with different syntactic representations. To solve this problem, firstly, a clustering method based on structured metric vectors was proposed to quickly identify structurally similar programs from a large number of existing programs. Next, a clustering method based on similar execution value sequences was proposed, to accurately identify the functional equivalent programs with code variations. This approach has been applied in automatic program repair, to identify sample programs from a large pool of template programs. The average purity value is 0.95576 and the average entropy is 0.15497. This means that the clustering partition is consistent with the expected partition.
Article Preview


Clustering is a widely used data mining technology. A cluster generated by a clustering algorithm is a set of data objects, which are similar to the objects in the same cluster, but different from the objects in other clusters. Many clustering methods have been proposed, such as hierarchical clustering, fuzzy clustering, K-Means clustering etc.

Program clustering refers to the clustering technology applied to the software field. It identifies and groups programs with similar attributes. It is widely used in program analysis and software maintenance, such as comprehension complex software systems (Mahmoud & Niu, 2013; Timothy & Nicolas, 2002), optimizing software architecture (Lung et al. 2006; Demme & Sethumadhavan, 2012), predicting software defects (Seliya & Khoshgoftaar, 2007), software evolution (Scanniello & Marcus, 2011), to improve the quality of software. Because software exists in various control systems, such as fuzzy systems, security-critical systems and so on, program clustering algorithm has received extensive attention.

We tried to apply program clustering in automatic repair of student programs. With the development of MOOCs and online educations, how to provide students with automatic feedback and assistance has become an urgent problem to be solved. Especially for the practical programming courses, the problem is more urgent. In the process of programming practice, students often encounter various errors that make their programs unable to run correctly. However, the current on-line programming system cannot prompt students the specific location of errors or the fixing scheme. Therefore, it is of great value to analyze student programs automatically and provide suggestions for fixing bugs.

Correct template programs written by teachers can provide useful guidance for bug fixing. However, writing templates increases the workload of teachers, and the templates provided may not be comprehensive. In the process of programming exercises or examination, a large number of student programs can be obtained. Many of these programs are correct and can be used as potential template programs. But student programs that implement the same programming task may have similar implementation forms or may be implemented in different ways. This requires analysis of these potential templates, removing redundant similar programs, and retaining multiple different implementations as templates. In addition, in the process of fixing bugs, the template program similar to that of the buggy program, can provide better reference for the revision of the buggy program, and avoid unnecessary modifications. The template programs identification problem can be solved by program clustering.

Most of the existing program clustering methods only analyze the static properties of programs but lack the consideration of execution information. This limits these methods in identifying the functional equivalent programs with code variations. Therefore, the purpose of this paper is to solve this problem and apply program clustering in automatic repairing of programs.

The contributions are as follows. A program clustering method considering both the structural metrics and execution value sequences is proposed. Not only the programs with similar structure can be identified, but also the programs with similar execution behavior can be recognized. By doing this, whether two functional equivalent programs adopt the same algorithm can be judged by the similarity of execution sequences.

Complete Article List

Search this Journal:
Volume 20: 1 Issue (2024): Forthcoming, Available for Pre-Order
Volume 19: 6 Issues (2023)
Volume 18: 4 Issues (2022): 2 Released, 2 Forthcoming
Volume 17: 4 Issues (2021)
Volume 16: 4 Issues (2020)
Volume 15: 4 Issues (2019)
Volume 14: 4 Issues (2018)
Volume 13: 4 Issues (2017)
Volume 12: 4 Issues (2016)
Volume 11: 4 Issues (2015)
Volume 10: 4 Issues (2014)
Volume 9: 4 Issues (2013)
Volume 8: 4 Issues (2012)
Volume 7: 4 Issues (2011)
Volume 6: 4 Issues (2010)
Volume 5: 4 Issues (2009)
Volume 4: 4 Issues (2008)
Volume 3: 4 Issues (2007)
Volume 2: 4 Issues (2006)
Volume 1: 4 Issues (2005)
View Complete Journal Contents Listing