Identification and Evaluation of Homographic Puns Using Similarity Methods

Divya Agrawal, Ani Thomas

Source Title: International Journal of Applied Evolutionary Computation (IJAEC) 12(1)

DOI: 10.4018/IJAEC.2021010102

OnDemand:

(Individual Articles)

Available

$37.50

Current Special Offers

No Current Special Offers

Abstract

Natural language processing is a subfield of linguistics concerned with the interactions between computers and human language, specifically in how to program computers to process and analyze large amounts of text data (natural language data). WSD, word sense disambiguation in natural language processing, is the task of determining the correct annotation of the pun word in given context. This paper describes about the endeavor in using cosine similarity method for detection of a single homographic pun in given context, its location, and the correct annotation with respect to helping words in the context. This paper includes two approaches: BIT_SYS1 and BIT_SYS2. The first contains the words having synset count one as it cannot be pun but it can serve as helping word to the pun, and in the later words with synset count one is eliminated and the concept of helping word is abandoned. Performance of BIT_SYS2 is better than BIT_SYS1 as F1 score of BIT_SYS2(0.8571, 1.0000, 1.0000) is higher than BIT_SYS1(0.8439, 0.8648, 0.8648) in pun detection task, pun location task, and pun annotation task.

Article Preview

Top

1. Introduction

Communication is utmost in everyday life and language plays an important role in communication between parties. But what if this communication is ambiguous? We employ an extra effort to understand the intention of the other party or sometimes we do not interpret the correct intention. This ambiguity arises because of ambiguous lexicons of that language. Lexicons are the words used in a particular language or subject which are considered to be carrying multiple distinct meanings, creating a humorous effect “punning effect” and called as ambiguous lexicons or puns. These words create lexical ambiguity and are a fundamental characteristic of all natural languages. Word sense disambiguation, the scientific study of language, the task of identifying a word's meaning in context, has long been recognized as an important task in computational linguistics

When given a pun word, its correct sense in that context could be obtained by words that surround the pun word designated as helping words. Annotations of the pun word are obtained by using a large database of linguistics that includes the semantic relationship between lexicons of specific language. WordNet is one such database that groups similar semantic related words into groups called as synsets along with its definition as synset definitions. The synset count enables us to decide whether the word is pun or not and the synset definition define its word’s annotation.

Natural language processing takes the context as input data, preprocesses it (removes useless data from text leaving behind the useful data) and extracts the required information from it. The text data is input to the natural language processing machine and preprocessing it includes various features like tokenization, stop words removal, stemming, lemmatizing, tagging as required by the work. This project includes only the tokenization and stop word removal feature of the preprocessing part.

1.1 Text Similarity

It is the measure of similarity between two given texts in lexical closeness and semantic similarity; it gives the values representing how close two texts are. Text similarity are used in search engines that matches the relevance of searched text and it’s documents, questions and answer sites as Quora need to determine whether a question has been asked before, in customer services about product searches and querying about the delivery, invoice etc. AI systems should be able to understand semantically similar questions from users and provide a uniform response. In all above, the emphasis is on semantic similarity which aims to create a system that recognizes language and word patterns to get responses that are similar to human conversation. Various methods can be used in order to get text similarity and one discussed here is cosine similarity method and is employed in this project.

1.1.1 Cosine Similarity

Cosine similarity is a metric used to measure how similar the documents are irrespective of their size. It measures the cosine of the angle between two vectors projected in multidimensional space. Even if the two documents are far apart from each other in Euclidean distance because of the size of the two documents it is possible that cosine similarity between these documents specify that they are similar. The smaller the angle between the vectors, higher the similarity it has. If both vectors overlap each other then they are same having same semantics, if they are perpendicular to each other then they are somewhat same semantic and somewhat different semantic but when they are completely opposite to each other then they have completely different semantics.

The cosine of two non-zero vectors can be derived by using Euclidean dot product formula:A.B = |A|. |B|. cos Θgiven two vectors of attributes A and B, the cosine similarity cos Θ, is represented using a dot product and magnitude as:

Algorithm 1.

Where, A_i and B_i are components of vector A and B respectively.

The resulting similarity ranges from −1 meaning exactly opposite, to 1 meaning exactly the same, while in-between values indicate intermediate similarity or dissimilarity.

The endeavor here is to convert each annotation of pun word and each annotation of helping word into vectors, then determining the cosine similarity value between every pair of annotation of pun word and helping word. In the proposed BIT_SYS1 system, first annotation of the pun is the one with the highest cosine angle between the annotation vectors of pun and first helping word, second annotation of the pun is the highest cosine angle between the remaining annotations of the pun and second helping word.

Whereas in the other proposed system BIT_SYS2, which defiles the concept of helping word, first annotation of the pun is one with the highest cosine angle between its own annotations and the second one is the one with the least cosine angle.

Complete Article List

Search this Journal:

Reset

Volume 14: 1 Issue (2024): Forthcoming, Available for Pre-Order

Volume 13: 4 Issues (2022): 2 Released, 2 Forthcoming

Volume 12: 4 Issues (2021)

Volume 11: 4 Issues (2020)

Volume 10: 4 Issues (2019)

Volume 9: 4 Issues (2018)

Volume 8: 4 Issues (2017)

Volume 7: 4 Issues (2016)

Volume 6: 4 Issues (2015)

Volume 5: 4 Issues (2014)

Volume 4: 4 Issues (2013)

Volume 3: 4 Issues (2012)

Volume 2: 4 Issues (2011)

Volume 1: 4 Issues (2010)

View Complete Journal Contents Listing

MLA

APA

Chicago

Export Reference