Special Offers
- IGI Global’s New Emerging Topic e-Book Collections
  Acquire highly focused and affordable Cutting-Edge Peer-Reviewed Research Content through a selection of 17 topic-focused e-Book Collections discounted up to 90%, compared to list prices. Collection topics include Artificial Intelligence, Data Science, Language Learning, Marketing and Customer Relations, Sustainability, and many more. Hosted on the InfoSci^® platform, these collections feature no DRM, no additional cost for multi-user licensing, no embargo of content, full-text PDF & HTML format, and more.
  Learn More
- Open Access Book (Free Access) - Encyclopedia of Information Science and Technology, Sixth Edition (ISBN: 9781668473665)
  The Encyclopedia of Information Science and Technology, Sixth Edition) continues the legacy set forth by the first five editions by providing comprehensive coverage and up-to-date definitions of the most important issues, concepts, and trends pertaining to technological advancements and information management within a variety of settings and industries. The entire book is being published under open access.
  Read Now
- Open Access Book (Free Access) - Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries (ISBN: 9781668456293)
  Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries provides information on the recent technology, mitigation, and environmental protection that must be applied for food sustainability in developing countries. This book is being published under Platinum Open Access through funding from Diponegoro University, Indonesia.
  Read Now
- Open Access Book (Free Access) - New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY (ISBN: 9781668438091)
  The Walmart Corporation and the Lumina Foundation have provided funding to make New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY fully open access, completely removing any paywall between scholars in education and the latest research on new models for the future of higher education.
  Read Now
- Open Access Book (Free Access) - Handbook of Research on the Global View of Open Access and Scholarly Communications (ISBN: 9781799898054)
  Through a collaboration between IGI Global and the University of North Texas, the Handbook of Research on the Global View of Open Access and Scholarly Communications has been published as fully open access, completely removing any paywall between researchers of any field, and the latest research on the equitable and inclusive nature of Open Access and all of its complications.
  Read Now
Books
- - Books by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Books by Field
Journals
- - Journals
  - OnDemand Journal Articles
  - Journals by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Journals by Field
e-Collections
Open Access
- View All Open Access Opportunities
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Find an Open Access Journal for Your Next Manuscript
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Submit an Open Access Book Proposal
  Learn more about open access book publishing and how it can propel your research forward in the field.
  Convert Your Work to Open Access
  Already published? You can convert your work to open access to increase its impact through IGI Global’s Restrospective Open Access Program.
  Utilize Open Access Collection Database
  Open up your research potential by utilizing our open access content or integrating the open access collection into your library
  Consider Open Access Agreements
  For Libraries: consider no-cost or investment-level open access agreements with IGI Global to support your faculty's research endeavors.
  Search Funding Resources
  Looking for additional funding resources to support your open accesss endeavors? View industry resources compiled by our open access team.
  Review Open Access Policies & Ethical Guidelines
  Considering IGI Global to publish your work under open access? Review IGI Global’s open access policies and ethical guidelines
Publish with Us
Resources
- - Instructors
  - Course Adoption
  - Teaching Cases
  - K-12 Online Learning Collection
  - Authors and Editors
  - eEditorial Discovery^® System
  - Peer Review Process
  - Ethics and Malpractice
  - COPE Membership
  - Fair Use Policy
  - Open Access Publishing
  - FAQ
Catalogs
About Us
Newsroom

Complex Terminology Extraction Model from Unstructured Web Text Based Linguistic and Statistical Knowledge

Fethi Fkih, Mohamed Nazih Omri

Source Title: International Journal of Information Retrieval Research (IJIRR) 2(3)

DOI: 10.4018/ijirr.2012070101

OnDemand:

(Individual Articles)

Available

$37.50

Current Special Offers

No Current Special Offers

Abstract

Textual data remain the most interesting source of information in the web. In the authors’ research, they focus on a very specific kind of information namely “complex terms”. Indeed, complex terms are defined as semantic units composed of several lexical units that can describe in a relevant and exhaustive way the text content. In this paper, they present a new model for complex terminology extraction (COTEM), which integrates linguistic and statistical knowledge. Thus, the authors try to focus on three main contributions: firstly, they show the possibility of using a linear Conditional Random Fields (CRF) for complex terminology extraction from a specialized text corpus. Secondly, prove the ability of a Conditional Random Field to model linguistic knowledge by incorporating grammatical observations in the CRF’s features. Finally, the authors present the benefits gained by the integration of statistical knowledge on the quality of the terminology extraction.

Article Preview

Top

1. Introduction

Data contained in the web are very heterogeneous; we can find several types of information: text, image, video, etc. The textual content remains the most interesting. As detailed in Kwok, Etzioni and Weld (2001), Popescu and Etzioni (2005), unstructured Web text is characterized, compared to other types of information, by several features: a huge volume, difficulty of extraction, heterogeneity of knowledge, wealth of useful information, etc.

The textual information is the main base of the process of information retrieval (IR). In the information retrieval process, the indexing task is very important. Indeed, poor indexing of documents will necessarily lead to bad results. Therefore it is important to improve the quality of the extraction of indexes to increase the efficiency of information retrieval on the Web.

We define the problem of indexing as follows: for a given document, how to present its content in an exhaustive and unambiguous way? Thus, the ultimate goal of indexing is to select semantic and meaningful tokens that can help with semantic modelling of documents. In the literature of the Natural Language Processing (NLP) field, these semantic tokens are often called “terms”.

The manual extraction of meaningful terms from textual documents is very costly in time and resources. So, it’s necessary to develop methods for an automatic term extraction.

Former works in the information extraction field focuses on exploiting structured and semi-structured text (Chang, Hsu, & Lui, 2003; Zhai & Liu, 2006; Subhashini & Jawahar Senthil Kumar, 2011). Recently, several research works are directed towards the extraction from unstructured Web text. We cite, among others, the use of lexico-syntactic patterns (Hearst, 1992), the use of generic patterns and a bootstrap approach in order to learn semantic relations from text (Pennacchiotti & Pantel, 2006), the use of a Relational Markov Network framework (Bunescu, Mooney, 2004), etc.

In our research, we are interested in a specific type of information, namely the terminology, which owns its own linguistic, statistical and semantic characteristics (detailed in the remainder of this article).

In this context, we propose a new model for terminology extraction. This hybrid model combines linguistic and statistical knowledge; it is composed of two main modules: linguistic for extraction and statistical for filtering.

The linguistic module is based on Conditional Random Fields (CRF) enriched by shallow linguistic knowledge. Indeed, probabilistic models and essentially the CRFs have proven their contributions in several application areas of Natural Language Processing (NLP) such as text chunking, Morphosyntactic annotation (Lafferty, McCallum & Pereira, 2001) and Named Entities Recognition (NER) (Okanohara, Miyao, Tsuruoka, & Tsujii, 2006). CRFs are not yet applied for terminology extraction from specialized text corpora. This may be due to the extraction difficulty and complexity of the relevant terms because of their linguistic nature and semantic specificity. Therefore, it is original to propose a model based CRF using linguistic knowledge for complex terminology extraction.

The statistical module is based on joint frequency calculations of tokens in a fixed-size window. The goal is to quantify the strength of connection between the lexical units. These statistical measures are considered good indicators to decide whether the coexistence of two lexical units is significant or not (due to chance).

In our research, we focus on specialized corpora (medical, biology, chemistry, etc.). This kind of textual document is characterized by a terminology reflecting specialized language of the considered field. In fact, the specialized language is rich in scientific and technical terms making them more visible and accessible and requiring no intervention of an expert to identify them.

The remainder of this paper is structured as follows. Section 2 presents the main approaches of term extraction from text documents. In section 3, we introduce our approach for the complex terminology extraction with a presentation of features used to model different linguistic observations and we focus on the theoretical principle of our statistical filter. Section 4 is reserved for the performance tests of our approach. Our experimental study was carried out on the standard test database MEDLARS and compared with other powerful models.

Complete Article List

Search this Journal:

Reset

Volume 14: 1 Issue (2024)

Volume 13: 1 Issue (2023)

Volume 12: 4 Issues (2022): 3 Released, 1 Forthcoming

Volume 11: 4 Issues (2021)

Volume 10: 4 Issues (2020)

Volume 9: 4 Issues (2019)

Volume 8: 4 Issues (2018)

Volume 7: 4 Issues (2017)

Volume 6: 4 Issues (2016)

Volume 5: 4 Issues (2015)

Volume 4: 4 Issues (2014)

Volume 3: 4 Issues (2013)

Volume 2: 4 Issues (2012)

Volume 1: 4 Issues (2011)

View Complete Journal Contents Listing

MLA

APA

Chicago

Export Reference

Complex Terminology Extraction Model from Unstructured Web Text Based Linguistic and Statistical Knowledge

Abstract

1. Introduction

Complete Article List