A BERT-Based Hybrid Short Text Classification Model Incorporating CNN and Attention-Based BiGRU

Tong Bao, Ni Ren, Rui Luo, Baojia Wang, Gengyu Shen, Ting Guo

Source Title: Journal of Organizational and End User Computing (JOEUC) 33(6)

DOI: 10.4018/JOEUC.294580

Article PDF Download Open access articles are freely available for download

Abstract

Short text classification is a research focus for natural language processing (NLP), which is widely used in news classification, sentiment analysis, mail filtering and other fields. In recent years, deep learning techniques are applied to text classification and has made some progress. Different from ordinary text classification, short text has the problem of less vocabulary and feature sparsity, which raise higher request for text semantic feature representation. To address this issue, this paper propose a feature fusion framework based on the Bidirectional Encoder Representations from Transformers (BERT). In this hybrid method, BERT is used to train word vector representation. Convolutional neural network (CNN) capture static features. As a supplement, a bi-gated recurrent neural network (BiGRU) is adopted to capture contextual features. Furthermore, an attention mechanism is introduced to assign the weight of salient words. The experimental results confirmed that the proposed model significantly outperforms the other state-of-the-art baseline methods.

Article Preview

Top

1. Introduction

Due to the development and widespread use of the internet and mobile devices, users are always encountering and processing massive amounts of text data, such as news insights, product reviews, and messages. These large amounts of text data contain information on human social attributes, content preferences, and psychology. The careful mining and scientific analysis of these text data can generate extremely high social value. As the most basic task in the process of text data mining and analysis, text classification has been widely used in various industry fields, such as topic tagging, public opinion analysis, mail filtering, and recommendation systems (Lin, Z., et al., 2016; Ren, Y. F., et al., 2016; Kiliroor, C. C., & Valliyammai, C. 2019; Sulthana, A. R., & Ramasamy, S. 2019). Generally, short text mainly includes news headlines, social issues, product reviews, etc. Most of these texts are unstructured with the characteristics of large size, sparseness, and irregularity. Therefore, extracting the features of short texts and correctly classifying them has become one of the current challenges in the field of natural language processing (NLP).

Deep learning is a branch of machine learning. Deep learning simulates the mechanism of the human brain by establishing a deep neural network and interprets and analyzes data, such as images, voices, and texts. In text classification, the most basic but critical part is to convert text into digital vectors that computers can understand, this process is called, “The Representation of Text.” The earliest technology of text representation was one-hot encoding where the dimension of the word index is set to 1 and all of the others are set to 0. However, this representation suffers from the problem of high sparsity and dimensional explosion; More importantly, it does not consider the weight of words to text. TF-IDF (Yu, C. T., & Salton, G. 1976) is an optimized one-hot model that evaluates the importance of a word in a document or corpus, but there are still problems of dimensionality, and the model cannot reflect the sequence of information. Therefore, follow-up work has focused on constructing distributed dense word vectors with low dimensions. Word2Vec (Mikolov, T., et al., 2013) is a kind of neural network language model that considers contextual semantic information while avoiding the problem of dimensionality, whics has significantly better effects than previous models. In addition, FastText (Joulin, A., et al., 2016) is a word vector calculation and text classification tool open-sourced by Facebook in 2016. while working on classification tasks, FastText can often achieve accuracy comparable to deep networks, but it is faster than deep neural networks in training time. However, both Word2vec and FastText are static models and cannot solve the problem of polysemous words. To address this issue, Pre-trained language models, such as Embedding from Language Models(ELMo) (Peters, M. E., et al., 2018), Generate Pretraining Model(GPT) (Radford, A., et al., 2018) and the Bidirectional Encoder Representations from Transformers model (BERT) (Devlin, J., et al., 2018), have replaced Word2Vec as the current trend of word representation. ELMo uses the bidirectional long short-term memory(BiLSTM) (Hochreiter, S., & Schmidhuber, J. 1997) structure to obtain a general semantic representation through pretraining, and migrates the representation as a feature to the specific task. In addition, BERT and GPT use the transformer structure for pretraining. The fine-tuning method can be applied to training downstream special tasks by reducing the pretrained parameters, which not only saves time and computing resources but also quickly achieves better results.

Complete Article List

Search this Journal:

Reset

Volume 36: 1 Issue (2024)

Volume 35: 3 Issues (2023)

Volume 34: 10 Issues (2022)

Volume 33: 6 Issues (2021)

Volume 32: 4 Issues (2020)

Volume 31: 4 Issues (2019)

Volume 30: 4 Issues (2018)

Volume 29: 4 Issues (2017)

Volume 28: 4 Issues (2016)

Volume 27: 4 Issues (2015)

Volume 26: 4 Issues (2014)

Volume 25: 4 Issues (2013)

Volume 24: 4 Issues (2012)

Volume 23: 4 Issues (2011)

Volume 22: 4 Issues (2010)

Volume 21: 4 Issues (2009)

Volume 20: 4 Issues (2008)

Volume 19: 4 Issues (2007)

Volume 18: 4 Issues (2006)

Volume 17: 4 Issues (2005)

Volume 16: 4 Issues (2004)

Volume 15: 4 Issues (2003)

Volume 14: 4 Issues (2002)

Volume 13: 4 Issues (2001)

Volume 12: 4 Issues (2000)

Volume 11: 4 Issues (1999)

Volume 10: 4 Issues (1998)

Volume 9: 4 Issues (1997)

Volume 8: 4 Issues (1996)

Volume 7: 4 Issues (1995)

Volume 6: 4 Issues (1994)

Volume 5: 4 Issues (1993)

Volume 4: 4 Issues (1992)

Volume 3: 4 Issues (1991)

Volume 2: 4 Issues (1990)

Volume 1: 3 Issues (1989)

View Complete Journal Contents Listing

MLA

APA

Chicago

Export Reference

A BERT-Based Hybrid Short Text Classification Model Incorporating CNN and Attention-Based BiGRU

Abstract

1. Introduction

Complete Article List