Extract Clinical Lab Tests From Electronic Hospital Records Through Featured Transformer Model

Extract Clinical Lab Tests From Electronic Hospital Records Through Featured Transformer Model

Lucy M. Lu, Richard S. Segall
DOI: 10.4018/IJPHIMT.336529
Article PDF Download
Open access articles are freely available for download

Abstract

Natural language, as a rich source of information, has been used as the foundation of the product review, the demographic trend, and the domain specific knowledge bases. To extract entities from texts, the challenge is, free text is so sparse that missing features always exist which makes the training processing incomplete. Based on attention mechanism in deep learning architecture, the authors propose a featured transformer model (FTM) which adds category information into inputs to overcome missing feature issue. When attention mechanism performs Markov-like updates in deep learning architecture, the importance of the category represents the frequency connecting to other entities and categories and is compatible with the importance of the entity in decision-making. They evaluate the performance of FTM and compare the performance with several other machine learning models. FTM overcomes the missing feature issue and performs better than other models.
Article Preview
Top

1. Introduction

Natural language is the most common means of providing information to different people, to different organizations, and to different subjects; therefore, text has become a rich source of information. However, with the development of information technology, the amount of documentation has become so great that it is impossible for humans to handle. We need to take advantage of the capacity of computers to automatically understand documentation and to group documentation into different categories, topics, and subjects so that the total amount of information becomes manageable for humans.

Text mining can be applied to question answering, spam detection, semantic analysis, news categorization, and content classification, to name a few uses. Depending on the application, text data can come from different sources, such as web pages, emails, chats, social media, tickets, product descriptions, invoices, insurance claims, user reviews, and so on. Due to the unstructured nature of the medium, it is challenging and time-consuming to extract information from text.

However, there are many open issues and challenges in text mining, such as synonyms, long-range dependencies, and multiple interacting features. Besides these challenges in the perspective of semantic analysis, we can also see issues from the perspective of data quality, such as missing features, lack of samples, and imbalanced classes. When these issues can be solved only through data modeling, not through data collection, the problem-specific data processing needs to be added to algorithms to overcome the limits on data quality. Another challenge is that the integration of domain knowledge could play an important role in text mining. Domain knowledge can help speed up text processing and increase the precision of the results. Domain-specific knowledge extraction requires semantic analysis to extract the association between the objects or concepts in the documentation. It is still challenging to make the semantic analysis efficient and scalable.

Feature engineering is an important step in machine learning. Typical text classification uses machine learning technology to perform Natural Language Processing (NLP) and to assign labels or tags to textual units, such as terms, sentences, paragraphs, documents, and queries. Normally, machine learning-based methodology performs classification in two steps: first selecting interesting features and then feeding features into classifiers to make predictions. In the training set, because the feature set plays the role of a shortcut of the context, it needs to be complete so that trained models can be used to retrieve results from new data. In other words, when the feature set is incomplete, only partial results can be retrieved by the trained models. Unlike traditional machine learning, deep learning trains word embeddings as the starting point of the classification. Feature engineering is done during model training by adjusting feature weights. However, the feature set still needs to be complete.

In terms of data modeling, text classification can be divided into two categories, one based on maximum likelihood and one based on minimum energy. For maximum likelihood-based methodology, we have Naive Bayes, Support Vector Machines (SVM), etc. For energy-based methodology, we have the Hidden Markov Model (HMM), the Conditional Random Fields model (CRF), etc. The difference between the two categories is not only in the technical details but also in how many language patterns can be modeled. Normally, maximum likelihood-based methodologies consider words to be independent tokens and use Bag of Words (BoW) to build sample sets. Minimum energy-based methodologies can fit models not only with individual words but also with the associations between words, making it possible to conduct semantic analysis. Deep learning is a separate architecture that trains word embeddings, and the classification layer is the last layer in the architecture.

Deep learning architecture is built upon neural networks. Neural approaches have the advantage of overcoming the limitations of feature engineering. Word embeddings convert input texts into an importance vector in which some words have higher significance and some words have lower significance. In this way, the words with higher significance can contribute more to the classification process and the words with lower significance can contribute less to classification process. It is optional to reduce the number of dimensions, but when the word embeddings are built, feature engineering is done.

Complete Article List

Search this Journal:
Reset
Volume 11: 1 Issue (2025): Forthcoming, Available for Pre-Order
Volume 10: 2 Issues (2024): 1 Released, 1 Forthcoming
Volume 9: 2 Issues (2022)
View Complete Journal Contents Listing