BERT Tokenization and Hybrid-Optimized Deep Recurrent Neural Network for Hindi Document Summarization

BERT Tokenization and Hybrid-Optimized Deep Recurrent Neural Network for Hindi Document Summarization

Sumalatha Bandari, Vishnu Vardhan Bulusu
Copyright: © 2022 |Pages: 28
DOI: 10.4018/IJFSA.313601
OnDemand:
(Individual Articles)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

Text summarization generates a concise summary of the available information by determining the most relevant and important sentences in the document. In this paper, an effective approach of document summarization is developed for generating summary of Hindi documents. The developed deep learning-based Hindi document summarization system comprises of a number of phases, such as input data acquisition, tokenization, feature extraction, score generation, and sentence extraction. Here, a deep recurrent neural network (Deep RNN) is employed for generating the scores of the sentences based on the significant features, wherein the weights and learning parameters of the deep RNN are updated by using the devised coot remora optimization (CRO) algorithm. Moreover, the developed CRO-Deep RNN is examined for its efficacy considering metrics, like recall-oriented understudy for gisting evaluation (ROUGE), recall, precision, and f-measure, and is found to have attained values of 80.896%, 95.700%, 95.051%, and 95.374%, respectively.
Article Preview
Top

1. Introduction

The emergence of the internet has led to the explosion of textual information and data that are easily accessible to all. The textual information is available as various types of documents or as images. Numerous studies have been carried out to devise techniques for reading and analyzing the data (Narang, et al., 2020; Kumar, et al., 2020). The Natural Language Processing (NLP) approaches have been effective in analysing textual information for obtaining the significant information contained (Li et al., 2022). Text summarization is an application of NLP, which is employed for selecting the most illustrative sentences in the text for generating summaries. It aims at creating the most significant information present in a document or multiple documents for producing a condensed version (Joshi, et al., 2021). Users can effectively understand the content of the document from the summaries without reading it completely. As the information is presented in a clutter-free format, in summary, the readers can identify the relevance of the text immediately. Moreover, it acts as an effective tool for removing irrelevant web pages while browsing (Garg & Saini, 2019). Text summarization can be basically categorized into two types, namely abstractive as well as extractive summarization. Abstractive summarization (Shi, et al., 2021) aims at producing a summary by taking into account the deep understanding of the sentences in the document (Ye & Li, 2021). An abstract summary is produced by interpreting the text by utilizing diversified features of the language. On the other hand, extractive summarization (Belwal, et al., 2021), selects the most necessary sentences in the document by considering the linguistic and statistical features of the sentences and then combining the selected sentences to generate the summary (Jain, et al., 2022).

Summarization can also be classified as single or multi-document summarization based on, whether summarization is applied to a class of documents or a single document (Gulati & Sawarkar, 2017). When compared to a single document, generating a summary from multiple documents is a challenging task owing to the redundant information, collection of significant data from various documents, and so on (Verma & Verma 2020; Yu et al., 2021). Though several approaches have been developed for generating a summary of resource-rich languages, like English (Xu, et al., 2022 ; Nan, et al., 2021), only a few studies have been carried out for summarization of Hindi documents. Hindi is the official language of India, and it is written using the Devanagari script. The script comprises a total of 10 numbers, 13 vowels, and 33 consonants. Further, it has a huge set of conjunctions that can be utilized in varied combinations, and it poses an explicit, structured set of rules for grammar, which makes it highly sophisticated and exceptional. Thus, analysing the document in the Hindi language to understand the basis idea requires huge effort (Puri & Singh, 2019; Zuo et al., 2020). Developing an automatic technique for Hindi document summarization is challenging in the case of stories and novels, as a story on average as about 20,000 words, whereas a novel is around 40,000 words. Thus, the prevailing approaches are not sufficient for generating a summary (Joshi, et al., 2021). Adding to this complexity, the compression ratio of the long documents impacts the computational cost, thereby affecting the performance of the system (Wu, et al., 2017; Yu et al., 2019). Summarization of the document should be done in such a way that the summary generated meets three criteria, such as relevancy, non-redundancy, and coverage. The summary should contain only information of relevance, with no repeated content, and it should contain details regarding all the contents of the document (Narayan, et al., 2020).

Complete Article List

Search this Journal:
Reset
Volume 13: 1 Issue (2024)
Volume 12: 1 Issue (2023)
Volume 11: 4 Issues (2022)
Volume 10: 4 Issues (2021)
Volume 9: 4 Issues (2020)
Volume 8: 4 Issues (2019)
Volume 7: 4 Issues (2018)
Volume 6: 4 Issues (2017)
Volume 5: 4 Issues (2016)
Volume 4: 4 Issues (2015)
Volume 3: 4 Issues (2013)
Volume 2: 4 Issues (2012)
Volume 1: 4 Issues (2011)
View Complete Journal Contents Listing