Language Independent Summarization Approaches

Firas Hmida (LINA Nantes-University, France)
DOI: 10.4018/978-1-4666-6042-7.ch023
In this chapter, the authors introduce monolingual and multilingual summarization and present the problem of dependence of language and linguistic knowledge of the process. Then they describe the most influential works and techniques in the field of automatic multilingual and language-independent summarization. This section is presented as a solution to solve the problem already explained. The authors present several language independent approaches and used techniques. In the next section, they study the behavior of these methods by discussing their limitations and perspectives.
An automatic summarization synthesizes a compressed representation of an information source while maintaining the important information of the original content. It is a very complicated task. However, generally, people still produce summaries so efficiently. Works in this field aimed to imitate the cognitive process of generating a summary. Since a long time, researches have focused on scientific documents and also on press reports. This work deals only with text summarization. We can distinguish two types of summaries: the first one is the single-document summary, when the source document is unique, whereas, in the second one, the multi-document summary, analyzed information may come from several documents. The summary can also have different purposes: It can be generic if it treats all the topics in a document with the same degree of importance, but if it deals with only one specific part of the information required, it is called an oriented summary.

One can think of an approach to summarization as being an extract or an abstract method, with rather different implications. The method using the extraction consists on selecting textual units (words, sentences, etc…) which are supposed to contain important information from the document and then assemble those units to produce an “extract”. In other words, an extract is a part taken from a source document in order to provide an overview of its content (Boudin, 2008). An “abstract” is to understand the contents of a source document and reformulate them. It is a gloss describing those contents with an implicit way, which means that they don’t have to feature with the same language used in the original document. (Lin and Hovy, 2003) said that nearly 65% of the sentences in manually created summaries are extracted from the source document without any modification.

The multilingual summarization stems from the monolingual automatic summarization: They both have the same functionalities, but the multilingual summarization comes up with a new dimension: globally, it is defined as a process that involves more than one language in the automatically text summarization process.

