Article Preview
TopIntroduction
In the big data era, how to accurately find the required data from massive texts is an important issue (Mohamed et al., 2020). The advancement of deep learning and big data (Wu et al., 2021; Liu, 2022) provides excellent support. Big data application research relies on the calculation of semantic text similarity (STS). Because of the drastic advancement of deep learning, the effect of calculating STS has been significantly enhanced.
STS calculation (Chen et al., 2021) is the essential topic in natural language processing (NLP), which is used to determine the similarities of two text pieces in a variety of big data applications, such as information retrieval (Song et al., 2018; Chen et al., 2021), automatic question answering (Chen, & Xu., 2021; Hu et al., 2021; Zheng et al., 2021), machine translation (Mistree et al., 2022; Niu et al., 2021; Cheng et al., 2021), and recommendation system (Gong et al., 2021; Ghasemi, & Momtazi, 2021). These tasks can be abstracted predominately as text semantic matching problems. The information retrieval task finds matching documents through user queries. The automatic question answering task finds the most appropriate candidate answer based on the question’s relevance. The task of machine translation is to match two languages based on their relevance. The recommendation system matches the relevant metrics that the user may be interested in with the user's behavior. In light of the fact that the rich semantic information provided within the text cannot be fully used, the similarity calculation of Chinese semantic text still faces great challenges.
Chinese text is abstract and complicated, and the standards for representing Chinese text are stricter. A CNN or recurrent neural network (RNN) is typically employed to encode text. Long short-term memory (LSTM) and gate recurrent unit (GRU) can effectively mitigate the gradient disappearance issue. However, the scalability of the cyclic neural network is extremely poor, and the CNN requires an enormous amount of calculations that cannot be balanced within the model’s capacity.
Recently, numerous scholars have made major contributions to STS activities, which can be grouped into three categories. The first category includes conventional methods that focus solely on the literal resemblance of elements, such as words, string sequences, and phrases between texts, and have significant limits, mainly including Jaccard distance and SimHash.
The second category relies on machine learning techniques that represent text as vectors and analyze semantic similarity using statistical methods, primarily including vector space model (VSM), latent semantic analysis (LSA), and others. However, the consideration of the position of words is ignored and performance in complex tasks is not so good.
The third group depends on deep learning approaches that use deep learning models to capture semantic information and interaction features from text. There are primarily three frameworks involved. One framework is a representational framework. The main idea is the “Siamese structure” (Bromley et al., 1993); it uses two symmetric networks to represent the text and calculate the similarity. It also has shared parameters and low complexity. The typical examples include deep structured semantic mode (DSSM) (Huang et al., 2013; Chen et al., 2020), and ARC-I (Hu et al., 2014). This framework lacks semantic interaction information during the encoding process and cannot measure the contextual importance of words. As a result, an interactive framework is proposed with “matching aggregation” as the central concept (Wang, & Jiang, 2016) and the attention mechanism used to boost textual interaction by collecting both interactive and semantic information. As a third framework, the pretraining model (Devlin et al., 2019; Liu et al., 2019; Zhang et al., 2021) is used to complete specific matching tasks by fine-tuning the model. Although its accuracy has improved, its order of magnitude, parameter size, and time cost are higher than the previous two frameworks. This model also has a big problem in balancing model capacity and accuracy.