Article Preview
Top1. Introduction And Literature Review
Plagiarized and collusion of assignments and course works has aroused the concern of academics. With a vast amount of information online, the Internet and electronic databases offer tremendous convenience for students to search and download relevant information for completing their assignments and students might be lured to plagiarism. It becomes necessary for academics to put effort to identify the plagiarized works and to properly educate students of intellectual property. However, scanning of students’ work for copying are not only time consuming but also impractical sometimes particularly in large classes where assignments are independently marked by multiple tutors.
More and more universities introduce the policy of compulsory plagiarism checking of students’ works. This motion has a direct impact on the workload of academics. Academics have to be involved in the sanction of students’ work and draw a reasonable line between fair use and plagiarism. Except verbatim copying, detection and proof of plagiarism are not trivial. To alleviate the burden of academics, universities can either develop their own plagiarism detection system (PDS) or subscribe services of commercial PDS.
Plagiarism typically involves copying idea of others without permission or appropriately crediting the source (Paredes et al. 2007). Martin (1994) has suggested six types of plagiarism, ranging from simple verbatim copying that are easy to detect to completely rewritten ideas of others that are difficult to recognize, as illustrated in Table 1.
Table 1. It would be extremely difficult, if not impossible, to detect all types of plagiarism mentioned in Table 1. Amongst these six types of plagiarism, word-for-word plagiarism and paraphrasing plagiarism are relatively easy to be reliably detected automatically without human involvement. Except these two types of plagiarism, the others are more elusive and therefore are difficult to develop efficient algorithms for automatic plagiarism detection (Mozgovoy et al. 2010; Barrón-Cedeño et al. 2013).
Existing plagiarism detection systems are specifically designed for either textual documents or programming codes. Suspected plagiarisms are identified within a local collection of documents (hermetic or collusion detection) or from external sources such as the Internet (Web detection). Both commercial products and open-source freeware are available. Most commercial plagiarism systems (e.g. Turnitin) are proprietary and installed in a server which provides services via the Internet. Open source freeware (e.g. Ferret, Sherlock and WCopyfind) are typically installed on the client-side. Most of the existing textual plagiarism detection systems are designed specifically for English. This paper will focus on server-side, hermetic, textual plagiarism detection for both English and Chinese languages.
Ferret is an open-source plagiarism detection software built by Lyon et al. (2001, 2002, 2003) in Computer Science department of University of Hertfordshire. The idea of the algorithm is to evaluate the similarity among the set of three consecutive words (trigrams) in the concerned documents. The similarity is then measured by the number of trigrams in common to both documents divided by the total number of distinct trigrams in the two documents. When the similarity measure between two documents surpasses a certain threshold, they are identified as suspicious plagiarism. The detection algorithm is very efficient. Another strength of Ferret is that plagiarisers need substantial effort to get through the detection, as simple word insertion, deletion and substitution can still be easily detected. However, Ferret does not handle the case when the plagiarism is copied from multiple sources.