Machine Learning Approach for Multi-Layered Detection of Chemical Named Entities in Text

Machine Learning Approach for Multi-Layered Detection of Chemical Named Entities in Text

Usha B. Biradar (Molecular Connections Pvt Ltd., Bangalore, India), Harsha Gurulingappa (Molecular Connections Pvt Ltd., Bangalore, India), Lokanath Khamari (Molecular Connections Pvt Ltd., Bangalore, India) and Shashikala Giriyan (Molecular Connections Pvt Ltd., Bangalore, India)
DOI: 10.4018/IJSSCI.2016010101
OnDemand PDF Download:


Identification of chemical named entities in text and subsequent linkage of information to biological events is of immense value to fulfill the knowledge needs of pharmaceutical and chemical R&D. A significant amount of investigation has been carried out since a decade for identifying chemical named entities at morphological level. However, a barrier still remains in terms of value proposition to scientists at chemistry level. Therefore, the work described here aims to circumvent the information barrier by adaptation of a Conditional Random Fields-based approach for identifying chemical named entities at various levels namely generic chemical level, morphological level, and chemistry level. Substantial effort has been invested on generation of suitable multi-level annotated corpora. Recommended machine learning practices such as active learning-based training corpus generation and feature optimization have been systematically performed. Evaluation of system performance and benchmarking against the other state-of-the-approaches showed improved results.
Article Preview


In today’s era of big data, scientific discovery process is largely dependent on integration, management and extraction of useful data from available literature (Borkum & Frey, 2014). Extracted information from text mining tasks in chemical literature domain mainly includes named entities. Mining the chemical named entities is aimed at extracting information on unique chemicals, identifying the extracted chemicals by indexing them to the databases and bibliographic sources, assign and verify relationships between chemical entities and biological process, diseases etc., (Eltyeb & Salim, 2014; Banville, 2006; Batchelor & Corbett, 2007).

Machine Learning (ML) which is the automation of processes attributed to human intelligence, in particular - learning, to make decisions and to solve problems based on learning outcomes (Russell et al., 1995; Bottou, 2014), provides tailor made solutions for the task of named entity recognitions. Of late, Conditional Random Fields (CRFs), a class of probabilistic ML methods have contributed to major success in Chemical Named Entity Recognition (CNER) (Klinger et al., 2008). Ambiguity in representations of chemical entities is perhaps the most prevalent limitations concerned with text mining applications to chemical literature amongst others like limited open text corpora and growing number of chemicals (Townsend et al., 2005; Gurulingappa et al., 2013). Figure 1 clearly demonstrates the necessity and importance of named entity recognition as a first step to enable knowledge discovery process in chemical scientific literature.

Figure 1.

Different representations of the chemical named entity ‘ethanol’

In spite of humongous work done on application of various approaches for chemical named entity recognition, most of the efforts have concentrated on identifying chemical names at generic level (e.g. chemical against non-chemical) or morphological level (e.g. trivial name, IUPAC, abbreviation, formula or chemical class). To the best of author’s knowledge, there is no effort on identifying chemical names at chemistry level such as organic, inorganic, organometallic, drug, macromolecule and so-forth. Primary reason is because generating annotated corpora is an extremely labor intensive task and similarly annotating corpora with multi-level information including chemistry information requires additional efforts from domain experts.

This work involves efforts from chemistry experts in generating a suitable multi-level labelled corpora as well as machine learning experts in designing and development of a CRFs-based system. The following sections describe methods used for corpus generation and annotation, training and evaluation of a classification model, and benchmarking the results against other state-of-the-art approaches.

Complete Article List

Search this Journal:
Open Access Articles: Forthcoming
Volume 9: 4 Issues (2017): 2 Released, 2 Forthcoming
Volume 8: 4 Issues (2016)
Volume 7: 4 Issues (2015)
Volume 6: 4 Issues (2014)
Volume 5: 4 Issues (2013)
Volume 4: 4 Issues (2012)
Volume 3: 4 Issues (2011)
Volume 2: 4 Issues (2010)
Volume 1: 4 Issues (2009)
View Complete Journal Contents Listing