Article Preview
Top1. Introduction
Text Summarization can be classified as extractive and abstractive methods. An extractive summarization method consists of selecting important sentences, paragraphs etc. from the original document to produce a compressed form of the original text. The importance of the sentences is decided based on the statistical and linguistic features of sentences. In contrast, an abstractive summarization method consists of understanding the original text and rephrasing it into different forms without changing the meaning conveyed in the original text, but in a compressed form of a summary. When compared with an extractive summary, the abstractive summary is a difficult and challenging task, which requires the semantic representation of the text, inference rules and natural language generation (Erkan & Radev 2004).
Extraction involves concatenating extracts taken from the corpus into a summary, whereas abstraction involves generating novel sentences from information extracted from the corpus. It has been observed that in the context of multi-document summarization of news articles, extraction may be inappropriate because it may produce summaries which are overly verbose or biased towards some sources (Barzilay et al., 1999). Extractive summarization (Gupta & Lehal 2010) includes selecting important information, paragraphs etc. from a document and combining it to form a new paragraph called as summery. The choice of the sentences depends upon statistical and linguistic features of the sentences. Extractive summaries are formulated by weighting the sentences as a function of high frequency words. Here, the most frequently occurring or the most favourably positioned text is considered to be the most important.
Abstractive summarization (Khan & Salim 2014) includes understanding the main concepts and relevant information of the main text and then expressing that information in short and clear format. Abstractive summarization techniques can again be classified into two categories- structured based and semantic based methods. Structured based approaches determine the most important information through documents by using templates, extraction rules and other structures such as tree, ontology etc. Semantic based approaches determine the most important information through, conceptual graphs, semantic networks, semantic graphs, etc. Abstractive summarization methods produce more coherent, less redundant and information rich summery. Generating abstract using abstractive summarization methods is a difficult task since it requires more semantic and linguistic analysis.
In general, the text summarization task is performed at various levels, such as the surface, entity and discourse (Hahn & Mani 2000). Surface-level approaches tend to represent information in terms of shallow parsers which can then be selectively combined to yield a selection function used to extract important information. Entity-level approaches (Mani & Maybury 1999) build an internal representation of the text, modeling text entities and their relationships. Text entities are units of texts, such as words, phrases, sentences or even paragraphs. These approaches tend to represent patterns of connectivity in the text to help determine what is salient. Discourse-level approaches (Mann & Thompson 1988) model the structure of the text and its relation to communicate goals.
Summarization is also carried out using graph-based approaches, such as LexRank (Erkan & Radev 2004) and TextRank (Mihalcea & Tarau 2004). LexRank has been applied to multi-document summarization, whereas TextRank has been applied to single document summarization and keyword extraction. Both the approaches apply a random walk in a fully connected undirected graph, to redistribute the node weights where text units (i.e. sentences) are represented as nodes and the similarities between the text units are represented as edges.