A Source Code Plagiarism Detecting Method Using Sequence Alignment With Abstract Syntax Tree Elements

A Source Code Plagiarism Detecting Method Using Sequence Alignment With Abstract Syntax Tree Elements

Hiroshi Kikuchi (The University of Electro-Communications, Japan), Takaaki Goto (The University of Electro-Communications, Japan), Mitsuo Wakatsuki (The University of Electro-Communications, Japan) and Tetsuro Nishino (The University of Electro-Communications, Japan)
DOI: 10.4018/978-1-5225-8057-7.ch016


Learning to program is an important subject in computer science courses. During programming exercises, plagiarism by copying and pasting can lead to problems for fair evaluation. Some methods of plagiarism detection are currently available, such as sim. However, because sim is easily influenced by changing the identifier or program statement order, it fails to do enough to support plagiarism detection. In this paper, the authors propose a plagiarism detection method which is not influenced by changing the identifier or program statement order. The authors also explain our method's capabilities by comparing it to the sim plagiarism detector. Furthermore, the authors reveal how our method successfully detects the presence of plagiarism.
Chapter Preview

2. Background

2.1. Sequence Alignment

Sequence alignment is a method to calculate a correspondence relationship among strings by adding a space or shifting the alphabetic positions. Strings obtained after alignment are also called sequence alignment. The similarity between two strings can be described by a score. In our method, we first obtain tokens by lexical analysis, and then calculate the sequence alignment and compare the obtained similarity scores.

Here is an example. Let s and t be strings. Sometimes the lengths of s and t are different. We insert a gap symbol “-” in order to line up the strings’ lengths. This process is called alignment. Figure 1 illustrates an example of calculating the alignment between “masters” and “stars”.

Figure 1.

Example of a sequence alignment


As described in Figure 1, it is possible to have several patterns of alignment. Usually, sequence alignment includes a minimum number of gap symbols and fewer different characters in the two strings. After obtaining the alignment, we then compare each character in the same position in the two strings. If two characters are the same, we score it as m point. If two characters are different, we score it as d. Moreover, if one or both characters include a gap symbol, then we score it as g. We can obtain an alignment score by adding the scores of every character. If m ≥ d ≥ g or m ≥ g ≥ d, the alignment with the maximum number of total scores is the optimal alignment. If we assume m = 1; d = −1; g = −2, then we can obtain alignment scores for the left- and right-hand sides of Figure 1 as −6 and −1, respectively. It can be seen that the right-hand alignment is appropriate for this example.

Usually, calculating an alignment score can be done using dynamic programming. The Needleman-Wunsch algorithm (Needleman, 1970) is an efficient algorithm based on dynamic programming.

Figure 2.

Example of an alignment graph


Calculating the alignment score by using dynamic programming is done as follows (Akutsu, 2007). Let Σ be a set of characters. Let s and t be strings, and 978-1-5225-8057-7.ch016.m01 and 978-1-5225-8057-7.ch016.m02 indicate the length of each string. s[i] means the ith character in a string. We also use “−” as a gap symbol. Note that the gap symbol is not a member of Σ, and it cannot appear in the same position of the alignments.

Here we show a definition of score function which calculates scores by checking consistent and inconsistent gaps for each character in the strings. 978-1-5225-8057-7.ch016.m03 is a function which means

,it holds
under 978-1-5225-8057-7.ch016.m06. In this paper, we use the following definition:
(1) where


Let D be a matrix of 978-1-5225-8057-7.ch016.m09. Matrix D stores the intermediate results of the alignment scores, which can be calculated by the following recursive formula:


This formula calculates the maximum weighted path on the alignment graph as shown in Figure 2.

Complete Chapter List

Search this Book: