A Hybrid Approach to Retrieve Knowledge from a Document

A Hybrid Approach to Retrieve Knowledge from a Document

Deepak Sahoo (IIIT-Bhubaneswar, Bhubaneswar, India) and Rakesh Chandra Balabantaray (IIIT Bhubaneswar, Bhubaneswar, India)
Copyright: © 2020 |Pages: 18
DOI: 10.4018/IJKM.2020010104

Abstract

The task of retrieving the theme of a document and presenting a shorter form compared to the original text to the user is a challenging assignment. In this article, a hybrid approach to extract knowledge from a text document is presented, in which three key sentence level relationships in association with the Markov clustering algorithm is used to cluster sentences in the document. After clustering, sentences are ranked in each cluster and the highest ranked sentences in each cluster are merged. In the end, to get the final theme of the document, the Gradient boosting technique XGboost is used to compress the newly generated sentence. The DUC-2002 data set is used to evaluate the proposed system and it has been observed that the performance of the proposed system is better than other existing systems.
Article Preview
Top

Introduction

Knowledge management (KM) is a method originated in the business world for unifying the huge amounts of documents generated from meetings, proposals, presentations, analytic papers, training materials (Bordoni et al., 2002). The documents created in an organization represent its potential knowledge. “Potential” because only parts of this data and information will be found helpful to be used by them to create organizational knowledge. In this view, one major challenge is the selection of relevant information from vast amounts of documents, and the ability of making it available for use and re-use by organization members. The objective of the “mainstream” of knowledge management is to ensure that the right information is delivered to the right person at the right time, in order to take the most appropriate decision. In this sense, KM is not aimed at managing knowledge per se, but to relate knowledge and its usage. Along with this line, we focus on the extraction of relevant information to be delivered to a decision maker.

The knowledge pyramid has been used for several years to illustrate the hierarchical relationships between data, information, knowledge, and wisdom. The revised knowledge pyramid model proposed by (Jennex, 2013, 2017) includes knowledge management as extraction of reality with a focus on organizational learning.

To this end, a range of Text Mining (TM) and Natural Language Processing (NLP) techniques can be used as an effective Knowledge Management System (KMS) supporting the extraction of relevant information from large amounts of unstructured textual data and, thus, the creation of knowledge (Bordoni et al., 2002).

There has been an explosion in the amount of text data from a variety of sources. This volume of text is an invaluable source of information and knowledge which needs to be effectively summarized to be useful. Text summarization refers to the technique of shortening long pieces of text. The intention is to create a coherent and fluent summary having only the main points outlined in the document. Furthermore, applying text summarization reduces reading time, accelerates the process of researching for information, and increases the amount of information that can fit in an area. Automatic text summarization methods are greatly needed to address the ever-growing amount of text data available online to both better help discover relevant information and to consume relevant information faster.

Extracting knowledge from a document is broadly classified into two types i.e. Extraction-based and Abstraction-based. The extractive technique involves pulling key sentences from the source document and combining them to make a summary. The extraction is made according to the defined metric without making any changes to the texts. The abstraction technique entails paraphrasing and shortening parts of the source document. The abstractive text summarization algorithms create new phrases and sentences that relay the most useful information from the original text.

It is observed from the literature that many works have been done to extract knowledge from a document from its inception by H.P. Luhn (1958). Further, this knowledge extraction from a document can also be viewed as two categories; the first one is extracting knowledge from a single document and the second one is extracting knowledge from multiple documents of a domain.

The fundamental method to extract knowledge from the document can be viewed as a three-step process. In the first step, we have to identify the important topics from the document and then to extract best sentences based on ranking from each topic. In the second step, the words that are necessary to convey a message are retained and other words are removed from the sentence. Finally, in the third step, we have to identify the phrases and group of words which can be replaced with single words without changing the meaning.

Complete Article List

Search this Journal:
Reset
Open Access Articles
Volume 16: 4 Issues (2020): 1 Released, 3 Forthcoming
Volume 15: 4 Issues (2019)
Volume 14: 4 Issues (2018)
Volume 13: 4 Issues (2017)
Volume 12: 4 Issues (2016)
Volume 11: 4 Issues (2015)
Volume 10: 4 Issues (2014)
Volume 9: 4 Issues (2013)
Volume 8: 4 Issues (2012)
Volume 7: 4 Issues (2011)
Volume 6: 4 Issues (2010)
Volume 5: 4 Issues (2009)
Volume 4: 4 Issues (2008)
Volume 3: 4 Issues (2007)
Volume 2: 4 Issues (2006)
Volume 1: 4 Issues (2005)
View Complete Journal Contents Listing