Article Preview
Top1. Introduction
The emergence of the internet has led to the explosion of textual information and data that are easily accessible to all. The textual information is available as various types of documents or as images. Numerous studies have been carried out to devise techniques for reading and analyzing the data (Narang, et al., 2020; Kumar, et al., 2020). The Natural Language Processing (NLP) approaches have been effective in analysing textual information for obtaining the significant information contained (Li et al., 2022). Text summarization is an application of NLP, which is employed for selecting the most illustrative sentences in the text for generating summaries. It aims at creating the most significant information present in a document or multiple documents for producing a condensed version (Joshi, et al., 2021). Users can effectively understand the content of the document from the summaries without reading it completely. As the information is presented in a clutter-free format, in summary, the readers can identify the relevance of the text immediately. Moreover, it acts as an effective tool for removing irrelevant web pages while browsing (Garg & Saini, 2019). Text summarization can be basically categorized into two types, namely abstractive as well as extractive summarization. Abstractive summarization (Shi, et al., 2021) aims at producing a summary by taking into account the deep understanding of the sentences in the document (Ye & Li, 2021). An abstract summary is produced by interpreting the text by utilizing diversified features of the language. On the other hand, extractive summarization (Belwal, et al., 2021), selects the most necessary sentences in the document by considering the linguistic and statistical features of the sentences and then combining the selected sentences to generate the summary (Jain, et al., 2022).
Summarization can also be classified as single or multi-document summarization based on, whether summarization is applied to a class of documents or a single document (Gulati & Sawarkar, 2017). When compared to a single document, generating a summary from multiple documents is a challenging task owing to the redundant information, collection of significant data from various documents, and so on (Verma & Verma 2020; Yu et al., 2021). Though several approaches have been developed for generating a summary of resource-rich languages, like English (Xu, et al., 2022 ; Nan, et al., 2021), only a few studies have been carried out for summarization of Hindi documents. Hindi is the official language of India, and it is written using the Devanagari script. The script comprises a total of 10 numbers, 13 vowels, and 33 consonants. Further, it has a huge set of conjunctions that can be utilized in varied combinations, and it poses an explicit, structured set of rules for grammar, which makes it highly sophisticated and exceptional. Thus, analysing the document in the Hindi language to understand the basis idea requires huge effort (Puri & Singh, 2019; Zuo et al., 2020). Developing an automatic technique for Hindi document summarization is challenging in the case of stories and novels, as a story on average as about 20,000 words, whereas a novel is around 40,000 words. Thus, the prevailing approaches are not sufficient for generating a summary (Joshi, et al., 2021). Adding to this complexity, the compression ratio of the long documents impacts the computational cost, thereby affecting the performance of the system (Wu, et al., 2017; Yu et al., 2019). Summarization of the document should be done in such a way that the summary generated meets three criteria, such as relevancy, non-redundancy, and coverage. The summary should contain only information of relevance, with no repeated content, and it should contain details regarding all the contents of the document (Narayan, et al., 2020).