Latent Dirichlet Allocation and POS Tags Based Method for External Plagiarism Detection: LDA and POS Tags Based Plagiarism Detection

Latent Dirichlet Allocation and POS Tags Based Method for External Plagiarism Detection: LDA and POS Tags Based Plagiarism Detection

Ali Daud (King Abdulaziz University, Jeddah, Saudi Arabia & International Islamic University, Islamabad, Pakistan), Jamal Ahmad Khan (International Islamic University, Islamabad, Pakistan), Jamal Abdul Nasir (International Islamic University, Islamabad, Pakistan), Rabeeh Ayaz Abbasi (King Abdulaziz University, Jeddah, Saudi Arabia & Quaid-i-Azam University, Islamabad, Pakistan), Naif Radi Aljohani (Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, Saudi Arabia), and Jalal S. Alowibdi (Faculty of Computing and Information Technology, University of Jeddah, Jeddah, Saudi Arabia)
Copyright: © 2018 |Pages: 17
DOI: 10.4018/IJSWIS.2018070103
OnDemand PDF Download:
No Current Special Offers


In this article we present a new semantic and syntactic-based method for external plagiarism detection. In the proposed approach, latent dirichlet allocation (LDA) and parts of speech (POS) tags are used together to detect plagiarism between the sample and a number of source documents. The basic hypothesis is that considering semantic and syntactic information between two text documents may improve the performance of the plagiarism detection task. Our method is based on two steps, naming, which is a pre-processing where we detect the topics from the sentences in documents using the LDA and convert each sentence in POS tags array; then a post processing step where the suspicious cases are verified purely on the basis of semantic rules. For two types of external plagiarism (copy and random obfuscation), we empirically compare our approach to the state-of-the-art N-gram based and stop-word N-gram based methods and observe significant improvements.
Article Preview

1. Introduction

The rapid growth of the internet has made it the largest publicly accessible information source of the world. Easy availability and access of documents have created a problem of plagiarism: copying others work to show others that the copied work is related to them without giving a reference to the original work. The problem of plagiarism is evident in academia. A large-scale study on 18,000 students shows that about 50% of the students plagiarized their work (McCabe et al., 2001). From exact document copy-paste (aka the verbatim), to paraphrasing or even translations from other languages, different forms of plagiarism happen in text documents (Stein et al., 2007). Developing an effective and automated tool for detecting plagiarism is a fascinating, practically useful, and challenging task.

External and intrinsic plagiarism detection are two main strategies for plagiarism detection (Stamatatos, 2011). External plagiarism detection is the approach to find passages in the suspicious documents against a set of possible source documents, whereas Intrinsic plagiarism detection aims at discovering plagiarism by inspecting only the input document without comparing it with possible source documents. We can define external plagiarism detection more formally as follows: Given a suspicious document, d, and set of source documents, SD, our goal is to find a set of passage pairs, P, such that,P = < pdi, pSDj > | ∀pdi,∀pSDj: pdi ∈ d ∧ pSDjSD ∧ | pdipSDj| > δ(1) where, pdi is a passage from d, pSDj is a passage from SD, and pdi ∩ pSDj shows that similarity between pdi and PSDj is greater than a threshold, δ, to consider as a plagiarism case. Similarity measure can be defined in many ways.

Usually the task of plagiarism detection comprises of three stages: text representation, similarity estimation (between a suspicious document and source documents), and extraction of sentences (plagiarized and the original). In the task of plagiarism detection, documents are typically represented by sequences of words or characters. Sliding windows of N-grams is the most popular method, which can be defined by the number of characters or the number of words (Schleimer, Wilkerson, & Aiken, 2003). Normally, windows of overlapping N-grams are generated. Overlapping N-grams requires more comparisons and hence gives better accuracy. Methods usually differ in the value of N. Representation can be based on content information (giving importance to important content terms) or structural information (giving importance to stop words). In principle, research works in document representation for plagiarism, can be classified in two categories, depending on the type of information or the features used to index the document terms: (1) content based information, and (2) structural information.

Works in the first category, e.g., (Gupta et al., 2010), give importance to important content terms, whereas works in the second category, e.g. (Stamatatos, 2011) take full advantage of stop word occurrences. So in second category, instead of eliminating stop words, they eliminate all the other tokens. Therefore, it is a method based exclusively on structural information rather than content information. In most of the cases, plagiarized passages are highly modified by changing word order, and words are replaced by synonyms. Because changing the basic syntactic structure is very difficult as compared to replacing synonyms. Hence, the need of using both syntactic and semantic information arises. The idea of using syntactic and semantic information to compute text similarity is well studied in many text mining tasks.

Complete Article List

Search this Journal:
Volume 19: 1 Issue (2023)
Volume 18: 4 Issues (2022): 2 Released, 2 Forthcoming
Volume 17: 4 Issues (2021)
Volume 16: 4 Issues (2020)
Volume 15: 4 Issues (2019)
Volume 14: 4 Issues (2018)
Volume 13: 4 Issues (2017)
Volume 12: 4 Issues (2016)
Volume 11: 4 Issues (2015)
Volume 10: 4 Issues (2014)
Volume 9: 4 Issues (2013)
Volume 8: 4 Issues (2012)
Volume 7: 4 Issues (2011)
Volume 6: 4 Issues (2010)
Volume 5: 4 Issues (2009)
Volume 4: 4 Issues (2008)
Volume 3: 4 Issues (2007)
Volume 2: 4 Issues (2006)
Volume 1: 4 Issues (2005)
View Complete Journal Contents Listing