Article Preview
TopIntroduction
The World Wide Web (WWW) has evolved as an information hub on the Internet. It manages the enormous flood of information and presents it to the Internet users (Ahmed & Mahmood, 2012). Text summarization tools can be an indispensible solution to the information overload problem because text summarization tools can be useful for generating a quick overview of an entire document, document indexing, question-answering systems, retrieval system, etc.
Depending upon size of the input to a summarization process, automatic text summarization can be of types: single document summarization and multidocument summarization. When only one document is the input, it is called single document text summarization and when the input is a set of related text documents, it is called multi-document summarization. A summary is produced in the form of an abstract or an extract. An extract is a summary consisting of a number of important textual units selected from the input(s) (Mani, 2001). An abstract is a summary, which represents the subject matter of an input(s) with the text units, which are generated by reformulating the salient units selected from an input(s).
Multilingual summarization is defined by Mani (2001) as “processing several languages, with summary in the same language as input. Since the summarization approach presented in this paper can process both English and Bengali documents and produce a summary in the same language as input, the summarization approach proposed by us can be regarded as a multi-lingual summarization approach.
The earliest works on text summarization use features such as sentence position, word importance, cue phrases, title information and sentence length for ranking sentences (Baxendale, 1958; Luhn, 1958; Edmundson, 1969; Sarkar, 2012a; Sarkar, 2012b). The approach presented in (Hewahi, & Abu Kwaik, 2012) uses multi-word terms such as entity objects (names, places) and specialized terminologies along with some semantic features specific to Arabic language for Arabic text summarization.
The centroid based summarization system (Radev, Jing, Styś, & Tam, 2004) assigns weights to the sentences based on the similarities of the sentences to centroid where centroid is represented by a set of single word terms whose weights are greater than a predefined threshold. The centroid based approach differs from our approach because the centroid based approach ignores the uses of multi-word keyphrases for text summarization and it does not also use keyphrases for maximizing diverse but relevant information in the summary.
Many research works on keyphrase extraction (Turney, 2000; Sarkar, Nasipuri, & Ghose, 2012) have marked text summarization, automatic indexing etc. as the application areas of keyphrase extraction. The work presented in Wu and Li (2008) considers keyphrases as document key concepts and incorporates document key concepts in search results. Unlike the above mentioned works that only concentrated on keyphrase extraction task, our proposed work investigates the uses of keyphrases (key concepts) in improving the summarization performance.
The previous works presented in D’Avanzo, and Magnini (2005) have used multi-word keyphrases for text summarization tasks. Our proposed summarization approach also differs from the summarization approach presented by D’Avanzo and Magnini (2005) because the approach presented in D’Avanzo and Magnini (2005) gives less emphasis on redundancy issue though redundancy is a crucial issue for text summarization tasks.