It is the measure of similarity between two given texts in lexical closeness and semantic similarity; it gives the values representing how close two texts are. Text similarity are used in search engines that matches the relevance of searched text and it’s documents, questions and answer sites as Quora need to determine whether a question has been asked before, in customer services about product searches and querying about the delivery, invoice etc. AI systems should be able to understand semantically similar questions from users and provide a uniform response. In all above, the emphasis is on semantic similarity which aims to create a system that recognizes language and word patterns to get responses that are similar to human conversation. Various methods can be used in order to get text similarity and one discussed here is cosine similarity method and is employed in this project.
1.1.1 Cosine Similarity
Cosine similarity is a metric used to measure how similar the documents are irrespective of their size. It measures the cosine of the angle between two vectors projected in multidimensional space. Even if the two documents are far apart from each other in Euclidean distance because of the size of the two documents it is possible that cosine similarity between these documents specify that they are similar. The smaller the angle between the vectors, higher the similarity it has. If both vectors overlap each other then they are same having same semantics, if they are perpendicular to each other then they are somewhat same semantic and somewhat different semantic but when they are completely opposite to each other then they have completely different semantics.
The cosine of two non-zero vectors can be derived by using Euclidean dot product formula:A.B = |A|. |B|. cos Θgiven two vectors of attributes A and B, the cosine similarity cos Θ, is represented using a dot product and magnitude as:
Algorithm 1.
Where, A
i and B
i are components of vector A and B respectively.
The resulting similarity ranges from −1 meaning exactly opposite, to 1 meaning exactly the same, while in-between values indicate intermediate similarity or dissimilarity.
The endeavor here is to convert each annotation of pun word and each annotation of helping word into vectors, then determining the cosine similarity value between every pair of annotation of pun word and helping word. In the proposed BIT_SYS1 system, first annotation of the pun is the one with the highest cosine angle between the annotation vectors of pun and first helping word, second annotation of the pun is the highest cosine angle between the remaining annotations of the pun and second helping word.
Whereas in the other proposed system BIT_SYS2, which defiles the concept of helping word, first annotation of the pun is one with the highest cosine angle between its own annotations and the second one is the one with the least cosine angle.