Hybridization Between Scoring Technique and Similarity Technique for Automatic Summarization by Extraction

Hybridization Between Scoring Technique and Similarity Technique for Automatic Summarization by Extraction

Mohamed Amine Boudia (Department of Computer Science, Dr. Tahar Moulay University of Saida, Saida, Algeria), Amine Rahmani (Department of Computer Science, Dr. Tahar Moulay University of Saida, Saida, Algeria), Mohamed Elhadi Rahmani (Department of Computer Science, Dr. Tahar Moulay University of Saida, Saida, Algeria), Abdelatif Djebbar (Department of Computer Science, Dr. Tahar Moulay University of Saida, Saida, Algeria), Hadj Ahmed Bouarara (Department of Computer Science, Dr. Tahar Moulay University of Saida, Saida, Algeria), Fatima Kabli (Department of Computer Science, Dr. Tahar Moulay University of Saida, Saida, Algeria) and Mohamed Guandouz (Department of Computer Science, Dr. Tahar Moulay University of Saida, Saida, Algeria)
DOI: 10.4018/IJOCI.2016010101
OnDemand PDF Download:
$37.50

Abstract

To generate a summary automatically, the theory gives three approaches: by classification, by understanding or by extraction which is the most used and easy to implement. The current literature presents three basic techniques in the extraction approach: Extraction by scoring, Extraction by similarity and last but not least extraction by prototype. In previous work, the authors have always used one technique only and after that the proposed many manner to optimize the results: by the optimization algorithm or even they introduces the bio-inspired method to optimize the performance of automatic summarizers like ants or spider socials. Each technique has of weakness and strength point. In fact, the authors proposed in this work to use two technique one after the other to compensate the weakness of each technique by the strength of the second technique. In this paper, the authors will give a short state of art that will allow them later to explain the weakness and strength of each technique, after that they will explain their approach of Hybridization we will done.
Article Preview

1. Introduction And Problematic

Day by day, the body of electronic textual information increases. It becomes increasingly difficult to access relevant information without using specific tools to access to the content of texts by rapid and effective means. Software engineering is more developed, we have not the same application generation problem, the hardware is also very developed, in our day personal machine are powerful. For this last has become a necessary task to find the specific method to access to the content of the texts.

A summary of a text is an effective way to represent the content of the texts, and allow quick access to their content. The proposition of an automatic summarization is to produce a short text covering the essential content of the source text. “We cannot imagine our daily life without summary” says Inderjeet Mani [Mani, 2001].

Headlines, the first paragraph of a newspaper article, newsletters, weather, tables of results of sports competitions and catalogues library are just the summary. Even in research, the author of the article must accompany their scientific papers with summaries (abstract) written by them.

We can use the automatic summaries to reduce the time and find the relevant documents or to reduce processing large text by identifying key information. The suggested procedure claims on the principle that high-frequency words in a document are important words” Luhn, H. P. (1958)

The current literature presents three approaches of automatic summarization:

  • Automatic Summarization by extraction: where we have three essential techniques: By Scoring, by Similarity or by prototype phrase. Edmundson, H. P. (1969) and Van Dijk, T. A. (1985).

  • Automatic Summarization by understanding: using method of semantics analysis. Salton, G., and al (1997) and Kintsch, W., & Van Dijk, T. A. (1978)

  • Automatic Summarization by automatic classification: using the method of bi-classification. Litvak, M., & Last, M. (2008, August)

In this paper we worked on automatic summarization by extraction, because it is a simple method to implement and gives good results; only that in previous works the summarization produced by extraction using a single technique at a time: Score, Similarity or sentence prototype.

Scoring gives generally good results, only that his weak point is its reduced ability to eliminate the phrase that is similar, in fact, if a sentence X passes the filter scoring a sentence Y which is similar to a point X will probably have a score that allows it to also pass the filter, which produces a repetitive sentence in the summary, which is logically false; secondly, the technical similarity has the strength to eliminate repetitive sentence, but its weakness is that it cannot ensure that the sentence is to keep a high weight, actually, as long as the sentence is greater the probability to have more similar phrase increases and we know that sentence is large tend to wear more information.

This work aims to use two techniques one after another, so that each one covers the technical point of weakness of the other and brings its power to the general approach, to see the impact of this proposition we experimented our approach and compared it with the result of using one technique.

Automatic summarization appeared earlier as a field of research in computer science from the axis of NLP (automatic Natural Language Processing), HP Luhn [Luhn 1958] proposed in 1958 a first approach to the development of automatic abstracts from extracting phrases.

In the early 1960s, HP Edmundson and other participants in the project TRW (Thompson Ramo Wooldridge Inc) [Edmundson 1963] proposed a new system of automatic summarization where it combined several criteria to assess the relevance of sentences to extract.

These works were made to identify the fundamental ideas around the automatic summarization, such as problems caused by extraction to build summaries (problems of redundancy, incompleteness, break, etc..), the theoretical inadequacy of the use of statistics, or the difficulties to understand a text (from semantic analysis) to summarize.

Complete Article List

Search this Journal:
Reset
Open Access Articles
Volume 7: 4 Issues (2017)
Volume 6: 4 Issues (2016)
Volume 5: 4 Issues (2015)
Volume 4: 4 Issues (2014)
Volume 3: 4 Issues (2012)
Volume 2: 4 Issues (2011)
Volume 1: 4 Issues (2010)
View Complete Journal Contents Listing