Efficient Weighted Semantic Score Based on the Huffman Coding Algorithm and Knowledge Bases for Word Sequences Embedding

Efficient Weighted Semantic Score Based on the Huffman Coding Algorithm and Knowledge Bases for Word Sequences Embedding

Nada Ben-Lhachemi (Sidi Mohamed Ben Abdellah University, Morocco) and El Habib Nfaoui (LIIAN Laboratory, Sidi Mohamed Ben Abdellah University, Morocco)
Copyright: © 2020 |Pages: 17
DOI: 10.4018/IJSWIS.2020040107

Abstract

Learning text representation is forming a core for numerous natural language processing applications. Word embedding is a type of text representation that allows words with similar meaning to have similar representation. Word embedding techniques categorize semantic similarities between linguistic items based on their distributional properties in large samples of text data. Although these techniques are very efficient, handling semantic and pragmatics ambiguity with high accuracy is still a challenging research task. In this article, we propose a new feature as a semantic score which handles ambiguities between words. We use external knowledge bases and the Huffman Coding algorithm to compute this score that depicts the semantic relatedness between all fragments composing a given text. We combine this feature with word embedding methods to improve text representation. We evaluate our method on a hashtag recommendation system in Twitter where text is noisy and short. The experimental results demonstrate that, compared with state-of-the-art algorithms, our method achieves good results.
Article Preview
Top

Introduction

Word embeddings are primary building blocks for diverse natural language processing (NLP) applications, such as sentiment analysis, and news and hashtag recommendation. They catch syntactic and semantic similarities between words. Word embedding techniques assume that linguistic items or words with similar distributions have similar meanings. They use the ability of neural networks to generate text embedding. Although word embedding techniques are very efficient, they do not handle ambiguities if a word has multiple meanings, and in the real world many of these kinds of words exist. Then, in that case, each word sense should have different representation in space.

Diverse sorts of models have been proposed by many researchers for text representation. One of the most familiar model is the bag-of-words (BoW) (Harris, 1981). The BoW model regards a document as a “BAG” containing words. It generates a vocabulary of all the unique words occurring in all the documents in the training set, disregarding the word order, and semantic and syntactical features. The recent enormous success of unsupervised word embeddings raises the evident question if similar methods could be acquired to enhance embeddings (i.e., semantic representations) of word sequences as well. Word embeddings are a set of techniques that represent words by vectors (regularly, hundreds of dimensions) in a predefined vector space. The major benefit of word embeddings is that they catch diverse similarities between words (Bojanowski et al., 2016; Mikolov et al., 2013; Pennington et al., 2014). Besides, their annotation is not costly, as they can be extracted from massive unannotated data sets. Word embedding techniques have been extended to different levels of text, for representing word sequences (i.e., sentences, paragraphs, microblogs, short and long documents), with noncomplex methods, such as a simple addition or concatenation of word vectors, or more involved methods, such as convolutional neural networks or recurrent neural networks (RNNs) (Arora et al., 2017; Iyyer et al., 2015; Le & Mikolov, 2014; Wang et al., 2016; Wieting et al., 2016;). Word embeddings and word sequences embedding techniques are powerful and effective for learning semantic similarities between linguistic items, based on their distributional properties in large samples of text data and then representing word sequences, respectively. Although word embedding techniques are very efficient, handling semantic and pragmatics ambiguity with high accuracy is still a challenging and open research task. Semantic ambiguity occurs when the meaning of the words themselves can be misinterpreted. Pragmatics ambiguity occurs when the context of a phrase gives it multiple different interpretations. Thereby, the understanding of text often requires requesting assistance from diverse external knowledge bases (KBs).

In order to address these issues, the authors propose a method based on the Huffman coding algorithm and external KBs to calculate a semantic score, that represents the semantic relatedness between all fragments composing a given text. The authors call this score “Huffman score,” and consider it as a new feature that they amalgamate with word embedding techniques for improving the generation of text embedding. The authors evaluate the performance of their system through a hashtag recommendation system. Hashtag recommendation in microblogging platforms, such as Twitter, is one of the most leading challenges in data mining, considering its particular features:

  • 1.

    The conciseness of tweets (utmost 280 characters), which drive bloggers to use unaccustomed language, high contextualization, and morphs/aliases. Thus, the content of tweets is intricate to understand even for humans occasionally.

  • 2.

    Classical problems of text comprehension, such as synonymy, polysemy, and ambiguity.

  • 3.

    Abused use of abbreviations and acronyms, owing to the character number restriction.

The authors demonstrate that the performance of their resulting tweet embeddings exceeds a good number of the baselines in hashtag recommendation task. Besides, the authors’ technique might be appropriate in many different contexts, since it has substantive use and does not require any supplementary training when it is exploited in general text.

Complete Article List

Search this Journal:
Reset
Open Access Articles
Volume 17: 4 Issues (2021): Forthcoming, Available for Pre-Order
Volume 16: 4 Issues (2020): 3 Released, 1 Forthcoming
Volume 15: 4 Issues (2019)
Volume 14: 4 Issues (2018)
Volume 13: 4 Issues (2017)
Volume 12: 4 Issues (2016)
Volume 11: 4 Issues (2015)
Volume 10: 4 Issues (2014)
Volume 9: 4 Issues (2013)
Volume 8: 4 Issues (2012)
Volume 7: 4 Issues (2011)
Volume 6: 4 Issues (2010)
Volume 5: 4 Issues (2009)
Volume 4: 4 Issues (2008)
Volume 3: 4 Issues (2007)
Volume 2: 4 Issues (2006)
Volume 1: 4 Issues (2005)
View Complete Journal Contents Listing