Complex Terminology Extraction Model from Unstructured Web Text Based Linguistic and Statistical Knowledge

Complex Terminology Extraction Model from Unstructured Web Text Based Linguistic and Statistical Knowledge

Fethi Fkih (MARS Research Unit, Faculty of sciences of Monastir, University of Monastir, Monastir, Tunisia) and Mohamed Nazih Omri (MARS Research Unit, Faculty of sciences of Monastir, University of Monastir, Monastir, Tunisia)
Copyright: © 2012 |Pages: 18
DOI: 10.4018/ijirr.2012070101
OnDemand PDF Download:


Textual data remain the most interesting source of information in the web. In the authors’ research, they focus on a very specific kind of information namely “complex terms”. Indeed, complex terms are defined as semantic units composed of several lexical units that can describe in a relevant and exhaustive way the text content. In this paper, they present a new model for complex terminology extraction (COTEM), which integrates linguistic and statistical knowledge. Thus, the authors try to focus on three main contributions: firstly, they show the possibility of using a linear Conditional Random Fields (CRF) for complex terminology extraction from a specialized text corpus. Secondly, prove the ability of a Conditional Random Field to model linguistic knowledge by incorporating grammatical observations in the CRF’s features. Finally, the authors present the benefits gained by the integration of statistical knowledge on the quality of the terminology extraction.
Article Preview

1. Introduction

Data contained in the web are very heterogeneous; we can find several types of information: text, image, video, etc. The textual content remains the most interesting. As detailed in Kwok, Etzioni and Weld (2001), Popescu and Etzioni (2005), unstructured Web text is characterized, compared to other types of information, by several features: a huge volume, difficulty of extraction, heterogeneity of knowledge, wealth of useful information, etc.

The textual information is the main base of the process of information retrieval (IR). In the information retrieval process, the indexing task is very important. Indeed, poor indexing of documents will necessarily lead to bad results. Therefore it is important to improve the quality of the extraction of indexes to increase the efficiency of information retrieval on the Web.

We define the problem of indexing as follows: for a given document, how to present its content in an exhaustive and unambiguous way? Thus, the ultimate goal of indexing is to select semantic and meaningful tokens that can help with semantic modelling of documents. In the literature of the Natural Language Processing (NLP) field, these semantic tokens are often called “terms”.

The manual extraction of meaningful terms from textual documents is very costly in time and resources. So, it’s necessary to develop methods for an automatic term extraction.

Former works in the information extraction field focuses on exploiting structured and semi-structured text (Chang, Hsu, & Lui, 2003; Zhai & Liu, 2006; Subhashini & Jawahar Senthil Kumar, 2011). Recently, several research works are directed towards the extraction from unstructured Web text. We cite, among others, the use of lexico-syntactic patterns (Hearst, 1992), the use of generic patterns and a bootstrap approach in order to learn semantic relations from text (Pennacchiotti & Pantel, 2006), the use of a Relational Markov Network framework (Bunescu, Mooney, 2004), etc.

In our research, we are interested in a specific type of information, namely the terminology, which owns its own linguistic, statistical and semantic characteristics (detailed in the remainder of this article).

In this context, we propose a new model for terminology extraction. This hybrid model combines linguistic and statistical knowledge; it is composed of two main modules: linguistic for extraction and statistical for filtering.

The linguistic module is based on Conditional Random Fields (CRF) enriched by shallow linguistic knowledge. Indeed, probabilistic models and essentially the CRFs have proven their contributions in several application areas of Natural Language Processing (NLP) such as text chunking, Morphosyntactic annotation (Lafferty, McCallum & Pereira, 2001) and Named Entities Recognition (NER) (Okanohara, Miyao, Tsuruoka, & Tsujii, 2006). CRFs are not yet applied for terminology extraction from specialized text corpora. This may be due to the extraction difficulty and complexity of the relevant terms because of their linguistic nature and semantic specificity. Therefore, it is original to propose a model based CRF using linguistic knowledge for complex terminology extraction.

The statistical module is based on joint frequency calculations of tokens in a fixed-size window. The goal is to quantify the strength of connection between the lexical units. These statistical measures are considered good indicators to decide whether the coexistence of two lexical units is significant or not (due to chance).

In our research, we focus on specialized corpora (medical, biology, chemistry, etc.). This kind of textual document is characterized by a terminology reflecting specialized language of the considered field. In fact, the specialized language is rich in scientific and technical terms making them more visible and accessible and requiring no intervention of an expert to identify them.

The remainder of this paper is structured as follows. Section 2 presents the main approaches of term extraction from text documents. In section 3, we introduce our approach for the complex terminology extraction with a presentation of features used to model different linguistic observations and we focus on the theoretical principle of our statistical filter. Section 4 is reserved for the performance tests of our approach. Our experimental study was carried out on the standard test database MEDLARS and compared with other powerful models.

Complete Article List

Search this Journal:
Open Access Articles: Forthcoming
Volume 7: 4 Issues (2017)
Volume 6: 4 Issues (2016)
Volume 5: 4 Issues (2015)
Volume 4: 4 Issues (2014)
Volume 3: 4 Issues (2013)
Volume 2: 4 Issues (2012)
Volume 1: 4 Issues (2011)
View Complete Journal Contents Listing