Plagiarism Detection Algorithm for Source Code in Computer Science Education

Plagiarism Detection Algorithm for Source Code in Computer Science Education

Xin Liu, Chan Xu, Boyu Ouyang
DOI: 10.4018/978-1-5225-8057-7.ch017
OnDemand:
(Individual Chapters)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

Nowadays, computer programming is getting more necessary in the course of program design in college education. However, the trick of plagiarizing plus a little modification exists among some students' home works. It's not easy for teachers to judge if there's plagiarizing in source code or not. Traditional detection algorithms cannot fit this condition. The author designed an effective and complete method to detect source code plagiarizing according to the popular way of students' plagiarizing. There are two basic concepts of the algorithm. One is to standardize the source code via filtration against to remove the majority noises intentionally blended by plagiarists. The other one is an improved Longest Common Subsequence algorithm for text matching, using statement as the unit for matching. The authors also designed an appropriate HASH function to increase the efficiency of matching. Based on the algorithm, a system was designed and proved to be practical and sufficient, which runs well and meet the practical requirement in application.
Chapter Preview
Top

2. Existing Methods And Shortcomings

Back in the 1970s, researchers started research of the similarity detection technology against source code. Halstead (1975) proposed the first algorithm named property counting method. The algorithm counted the operators and operands statistics appeared in the source program, and used the results as main basis of detecting. Ottenstein (1976) implemented the first source code near-duplicates detection system for Fortran by using properties counting method. Since the attribute notation doesn’t remain the program structure information, the method cannot meet practical requirements of short program due to high false alarm rate (definition in section 4).

In the mid-1990s, Verco and Wise (1996) added vector dimension technology to the properties counting method, but the effect is still not satisfactory. Damashek (1995) proposed structural measure approach, used program control flow as metrics, such methods are usually applicated with attribute notation. Such methods work well in checking large programs, because in handling complex problems, different programmers often have different ideas, probability of identical program control flow is extremely low, so the false alarm rate is relatively low, but experiments proved that when such algorithms applying on program designing jobs, it has a relatively high false alarm rate. Because programming as common work is simple and the fundamental knowledge is quite similar, so the students’ main concepts of solving the problems are similar, thus the control flow of the program will be basically alike.

Complete Chapter List

Search this Book:
Reset