Serialized Co-Training-Based Recognition of Medicine Names for Patent Mining and Retrieval

Serialized Co-Training-Based Recognition of Medicine Names for Patent Mining and Retrieval

Na Deng, Caiquan Xiong
Copyright: © 2020 |Pages: 21
DOI: 10.4018/IJDWM.2020070105
OnDemand:
(Individual Articles)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

In the retrieval and mining of traditional Chinese medicine (TCM) patents, a key step is Chinese word segmentation and named entity recognition. However, the alias phenomenon of traditional Chinese medicines causes great challenges to Chinese word segmentation and named entity recognition in TCM patents, which directly affects the effect of patent mining. Because of the lack of a comprehensive Chinese herbal medicine name thesaurus, traditional thesaurus-based Chinese word segmentation and named entity recognition are not suitable for medicine identification in TCM patents. In view of the present situation, using the language characteristics and structural characteristics of TCM patent texts, a modified and serialized co-training method to recognize medicine names from TCM patent abstract texts is proposed. Experiments show that this method can maintain high accuracy under relatively low time complexity. In addition, this method can also be expanded to the recognition of other named entities in TCM patents, such as disease names, preparation methods, and so on.
Article Preview
Top

Introduction

In China, the culture of Traditional Chinese Medicine (TCM) has a history of thousands of years. The mild property and powerful effect of traditional Chinese medicines has made people around the world more and more prior to use them to treat chronic diseases and miscellaneous diseases. As the carrier recording the most advanced inventions of natural medicine, the application of TCM patents increases constantly.

In the era of big data, the accumulated TCM patents contain abundant economic, legal and medical information. The retrieval, analysis and mining of TCM patents can help to utilize the potential information, and provide important decision support for medicine researchers or pharmaceutical enterprises. For example, the retrieval of TCM patents can help researchers to stimulate their inspiration, find technology blank areas and determine the R&D objectives of medicine innovation; similarity analysis of TCM patents can help researchers avoid infringement risks; hot technology forecasting and value analysis of TCM patents can help to provide guidance for pharmaceutical enterprises in purchasing valuable patents.

Analysis, mining and retrieval of TCM patents is a subtopic of natural language processing. A key step is Chinese word segmentation and named entity recognition. Medicine name is a kind of common and important named entity in TCM patent texts, but its identification has the following difficulties:

  • 1.

    The alias phenomenon of traditional Chinese medicines is especially common. Due to the great variety of traditional Chinese medicines, complex sources and extensive production areas, and influenced by written errors, regional dialects and usage habits after thousands of years of inheritance, it often occurs that an herbal medicine has several different names or a name corresponds to several different medicines. For example, Ginseng has more than 10 aliases, such as “Jilinshen”, “Yishanshen”, “Shizhushen”, “Shencao”, “Dijing”, “Tujing”, etc.; herbal medicines “Sanqi”, “Tudahuang”, “Hutoujiao”, “Diburong” have a common alias name “Jinbuhuan”. At present, there is still not a comprehensive thesaurus of Chinese herbal medicine names;

  • 2.

    For the mainstream Chinese word segmentation systems, due to the lack of a comprehensive thesaurus of Chinese herbal medicine names, Chinese word segmentation systems often mistakenly segment words for Chinese herbal medicine names. For example, “Niuxixi” (Chinese Pinyin of a herbal name, similar thereafter) is mistakenly segmented into “Niu\xixi”; “Maihu” is mistakenly segmented into “Mai\hu”.

For TCM patent retrieval, recall rate is more important than accuracy in most cases. This is because missing search of some patents can bring about patent infringement, which may not only cause high litigation costs, but also lead to duplicate research and waste of human and material resources. For example, if we can correctly identify traditional Chinese medicine names such as “Ginseng”, “Jilinshen”, “Yishanshen”, “Shizhushen”, “Shencao”, “Dijing”, “Tujing”, etc., and catch the alias relationship between them, then when the search term is “Ginseng”, the retrieval system can also return the patents containing those aliases, which will greatly improve the recall rate and reduce the risk of missing search.

Similarly, for similarity calculation, recommendation, classification and clustering of TCM patents, if all the Chinese medicine names are identified and their alias relationships are considered, the accuracy of these various patent mining tasks will be greatly improved.

Because of the lack of a comprehensive Chinese herbal medicine name thesaurus, traditional Chinese word segmentation and named entity recognition are not suitable for TCM patent texts. In view of this situation, in this paper, using the language characteristics and structure characteristics of TCM patent texts, and modifying co-training method in machine learning, we propose a method of identifying medicine names from the abstract texts of TCM patents, in order to improve the accuracy of patent mining and retrieval.

Top

In view of the rich economic, legal and technological information hidden in them, patents have gradually become the objects of retrieval, analysis and mining in recent years.

Complete Article List

Search this Journal:
Reset
Volume 20: 1 Issue (2024)
Volume 19: 6 Issues (2023)
Volume 18: 4 Issues (2022): 2 Released, 2 Forthcoming
Volume 17: 4 Issues (2021)
Volume 16: 4 Issues (2020)
Volume 15: 4 Issues (2019)
Volume 14: 4 Issues (2018)
Volume 13: 4 Issues (2017)
Volume 12: 4 Issues (2016)
Volume 11: 4 Issues (2015)
Volume 10: 4 Issues (2014)
Volume 9: 4 Issues (2013)
Volume 8: 4 Issues (2012)
Volume 7: 4 Issues (2011)
Volume 6: 4 Issues (2010)
Volume 5: 4 Issues (2009)
Volume 4: 4 Issues (2008)
Volume 3: 4 Issues (2007)
Volume 2: 4 Issues (2006)
Volume 1: 4 Issues (2005)
View Complete Journal Contents Listing