Creating Paraphrase Identification Corpus for Indian Languages: Opensource Data Set for Paraphrase Creation

Anand Kumar M. (Department of Information Technology, National Institute of Technology Karnataka, Surathkal, India), Shivkaran Singh (Arnekt Solutions, Pune, India), Praveena Ramanan (Quantiphi Inc, Bangalore, India), Vaithehi Sinthiya (Karunya Institute of Technology and Sciences, Coimbatore, India) and Soman K. P. (Amrita School of Engineering, Amrita Vishwa Vidyapeetham University, India)
DOI: 10.4018/978-1-5225-9643-1.ch008


In recent times, paraphrase identification task has got the attention of the research community. The paraphrase is a phrase or sentence that conveys the same information but using different words or syntactic structure. The Microsoft Research Paraphrase Corpus (MSRP) is a well-known openly available paraphrase corpus of the English language. There is no such publicly available paraphrase corpus for any Indian language (as of now). This chapter explains the creation of paraphrase corpus for Hindi, Tamil, Malayalam, and Punjabi languages. This is the first publicly available corpus for any Indian language. It was used in the shared task on detecting paraphrases for Indian languages (DPIL) held in conjunction with Forum for Information Retrieval & Evaluation (FIRE) 2016. The annotation process was performed by a postgraduate student followed by a two-step proofreading by a linguist and a language expert.
Typologies Of Paraphrases

A paraphrase is a distinct technique to shape different language models (Barreiro, A., 2009). Linguistically, paraphrases are described in terms of meaning or semantics. According to Meaning-Text theory (Mel'čuk, I. A., & Polguere, A, 1987), in a language, if one or more syntactic construction (sentence formation) preserves semantic equality, those are considered as paraphrases. The agreement of semantic likeness between the source and paraphrased text expresses the range of semantic similarity between them. Paraphrasing is typically associated with synonyms. Various other linguistic units such as semi-synonyms, metaphors, linguistic entailment, and figurative meaning are considered as the components for paraphrasing. It is not only seen at the lexical level. It also found in other levels such as phrasal and sentential level (Zhao, S., Liu, T., Yuan, X., Li, S., & Zhang, Y, 2007). Various levels of paraphrasing reveal the diversified classes of paraphrases and the relationship to its source document. Some of the most common paraphrase types are described below (Barrón-Cedeño, A., Vila, M., Martí, M. A., & Rosso, P, 2013).

Lexical paraphrasing is the major method of paraphrasing found commonly in the literature. For instance (Anand Kumar, M., Singh, S., Kavirajan, B., Soman, K.P, 2018), If a source sentence is, “The two ships were acquired by the navy after the war”, then the proper paraphrased variants are: “The two ships were conquered by the navy after the war” and “The two ships were won by the Navy after the war”. There are still more paraphrases feasible for the given example sentence. The source verb ‘acquire’ is paraphrased with its synonym words 'conquer' and 'win'. In lexical paraphrases, the source sentence and paraphrased sentence show similar syntactic structural phenomena.

