SRU-based Multi-angle Enhanced Network for Semantic Text Similarity Calculation of Big Data Language Model

SRU-based Multi-angle Enhanced Network for Semantic Text Similarity Calculation of Big Data Language Model

Jing Huang, Keyu Ma
DOI: 10.4018/IJITSA.319039
Article PDF Download
Open access articles are freely available for download

Abstract

As a fundamental problem of natural language processing (NLP), the calculation of semantic text similarity plays a crucial role in a variety of big data application situations. In the process of text similarity modeling, however, owing to the complexity and ambiguity of Chinese semantics, effectively capturing the semantic interaction characteristics of Chinese text only from a single angle is impossible. This study proposes a deep learning-based computational model for semantic text similarity called SRU-based multi-angle enhanced network (SMAEN). Specifically, the authors firstly combine character-grained embeddings and word-granularity embeddings obtained from the pre-trained model to represent text. The text is encoded using a bidirectional simple recurrent unit (Bi-SRU) network, and the local text similarity is represented using a soft-aligned attention technique. In addition, the authors integrate Bi-SRU with an improved convolutional neural network (CNN) for global similarity modeling to capture semantic, time, and spatial characteristics of short text interaction. Finally, they employ a pooling layer to aggregate the calculation results into a fixed-length vector and a multi-layer perceptual (MLP) classifier to make a determination. Experimental results on Chinese public datasets LCQMC and PAWS-X show that the proposed method fully captures semantic interaction features from multiple angles and achieves advanced performance. This method can produce better matching results and enhance the accuracy of large data analysis. It is applicable to numerous scenarios involving large data, such as information retrieval and recommendation systems.
Article Preview
Top

Introduction

In the big data era, how to accurately find the required data from massive texts is an important issue (Mohamed et al., 2020). The advancement of deep learning and big data (Wu et al., 2021; Liu, 2022) provides excellent support. Big data application research relies on the calculation of semantic text similarity (STS). Because of the drastic advancement of deep learning, the effect of calculating STS has been significantly enhanced.

STS calculation (Chen et al., 2021) is the essential topic in natural language processing (NLP), which is used to determine the similarities of two text pieces in a variety of big data applications, such as information retrieval (Song et al., 2018; Chen et al., 2021), automatic question answering (Chen, & Xu., 2021; Hu et al., 2021; Zheng et al., 2021), machine translation (Mistree et al., 2022; Niu et al., 2021; Cheng et al., 2021), and recommendation system (Gong et al., 2021; Ghasemi, & Momtazi, 2021). These tasks can be abstracted predominately as text semantic matching problems. The information retrieval task finds matching documents through user queries. The automatic question answering task finds the most appropriate candidate answer based on the question’s relevance. The task of machine translation is to match two languages based on their relevance. The recommendation system matches the relevant metrics that the user may be interested in with the user's behavior. In light of the fact that the rich semantic information provided within the text cannot be fully used, the similarity calculation of Chinese semantic text still faces great challenges.

Chinese text is abstract and complicated, and the standards for representing Chinese text are stricter. A CNN or recurrent neural network (RNN) is typically employed to encode text. Long short-term memory (LSTM) and gate recurrent unit (GRU) can effectively mitigate the gradient disappearance issue. However, the scalability of the cyclic neural network is extremely poor, and the CNN requires an enormous amount of calculations that cannot be balanced within the model’s capacity.

Recently, numerous scholars have made major contributions to STS activities, which can be grouped into three categories. The first category includes conventional methods that focus solely on the literal resemblance of elements, such as words, string sequences, and phrases between texts, and have significant limits, mainly including Jaccard distance and SimHash.

The second category relies on machine learning techniques that represent text as vectors and analyze semantic similarity using statistical methods, primarily including vector space model (VSM), latent semantic analysis (LSA), and others. However, the consideration of the position of words is ignored and performance in complex tasks is not so good.

The third group depends on deep learning approaches that use deep learning models to capture semantic information and interaction features from text. There are primarily three frameworks involved. One framework is a representational framework. The main idea is the “Siamese structure” (Bromley et al., 1993); it uses two symmetric networks to represent the text and calculate the similarity. It also has shared parameters and low complexity. The typical examples include deep structured semantic mode (DSSM) (Huang et al., 2013; Chen et al., 2020), and ARC-I (Hu et al., 2014). This framework lacks semantic interaction information during the encoding process and cannot measure the contextual importance of words. As a result, an interactive framework is proposed with “matching aggregation” as the central concept (Wang, & Jiang, 2016) and the attention mechanism used to boost textual interaction by collecting both interactive and semantic information. As a third framework, the pretraining model (Devlin et al., 2019; Liu et al., 2019; Zhang et al., 2021) is used to complete specific matching tasks by fine-tuning the model. Although its accuracy has improved, its order of magnitude, parameter size, and time cost are higher than the previous two frameworks. This model also has a big problem in balancing model capacity and accuracy.

Complete Article List

Search this Journal:
Reset
Volume 17: 1 Issue (2024)
Volume 16: 3 Issues (2023)
Volume 15: 3 Issues (2022)
Volume 14: 2 Issues (2021)
Volume 13: 2 Issues (2020)
Volume 12: 2 Issues (2019)
Volume 11: 2 Issues (2018)
Volume 10: 2 Issues (2017)
Volume 9: 2 Issues (2016)
Volume 8: 2 Issues (2015)
Volume 7: 2 Issues (2014)
Volume 6: 2 Issues (2013)
Volume 5: 2 Issues (2012)
Volume 4: 2 Issues (2011)
Volume 3: 2 Issues (2010)
Volume 2: 2 Issues (2009)
Volume 1: 2 Issues (2008)
View Complete Journal Contents Listing