Patient Data De-Identification: A Conditional Random-Field-Based Supervised Approach

Patient Data De-Identification: A Conditional Random-Field-Based Supervised Approach

Shweta Yadav (Indian Institute of Technology Patna, India), Asif Ekbal (Indian Institute of Technology Patna, India), Sriparna Saha (Indian Institute of Technology Patna, India), Parth S. Pathak (ezDI, LLC, India) and Pushpak Bhattacharyya (Indian Institute of Technology Patna, India)
Copyright: © 2017 |Pages: 20
DOI: 10.4018/978-1-5225-2498-4.ch011
OnDemand PDF Download:
List Price: $37.50
10% Discount:-$3.75


With the rapid increment in the clinical text, de-identification of patient Protected Health Information (PHI) has drawn significant attention in recent past. This aims for automatic identification and removal of the patient Protected Health Information from medical records. This paper proposes a supervised machine learning technique for solving the problem of patient data de- identification. In the current paper, we provide an insight into the de-identification task, its major challenges, techniques to address challenges, detailed analysis of the results and direction of future improvement. We extract several features by studying the properties of the datasets and the domain. We build our model based on the 2014 i2b2 (Informatics for Integrating Biology to the Bedside) de-identification challenge. Experiments show that the proposed system is highly accurate in de-identification of the medical records. The system achieves the final recall, precision and F-score of 95.69%, 99.31%, and 97.46%, respectively.
Chapter Preview


With the start of the golden era in the medical interpretation, the vast amount of information in the clinical domain is increasing at a rapid rate. In the past decade, with the development of the health information technology and health data documentation, there has been progress in how heath care is performed (Berner et al., 2005).

With the widespread use of health information technology, there has been huge pace in the increment of clinical data in addition to the fast adoption of the Electronic Clinical Records and with the conversion of narrative data to the electronic form. The amount of information can be improved further with the minimization of the medical error. This requires the development of some sophisticated tools for Medical Language Processing (MLP). Most medical records are in the narrative forms which are formed as the result of transcription of dictations, direct entry by providers, or use of speech recognition applications. However, their use in this form is restricted to any organization or research, as medical records have a sufficient number of personal health information or protected health information (PHI). According to Health Insurance Portability and Accountability Act (HIPAA), 1996, the PHI terms need to be enclosed and protected. This has lead to de- identification problem. Paragraph 164.514 of the Administrative Simplification Regulations promulgated under the Health Insurance Portability and Accountability Act (HIPAA) states that for data to be treated as de-identified, it must clear one of two hurdles (HIPPA ACT 1996).

  • 1.

    An expert must determine and document “that the risk is very small that the information could be used, alone or in combination with other reasonably available information, by an anticipated recipient to identify an individual who is a subject of the information.”

  • 2.

    Or, the data must be purged from a specified list of seventeen categories of possible identifiers relating to the patient or relatives, household members and employers, and any other information that may make it possible to identify the individual.

Studies showed that there was a significant drop in the patient consent request reducing the participation rate and also, this is quite infeasible for the huge population. Even, in the case when a patient provides the permission, documents must be tracked to stop any unauthorized disclosure. This emerging problem of consent, waiver, and tracking can be effectively handled if the patient personal health information is properly de-identified facilitating the clinical NLP research (Wolf & Bennett, 2006).

De-identification task is more specifically defined as the step where the private information is removed or replaced while keeping the record as it is (Stubbs et al., 2015). De-identification is a type of traditional named entity recognition (NER) problem, with the property of defining a term to be PHI type or not. The main aim of de-identification challenge as pointed out earlier is to remove the PHI terms maintaining data integrity as much as possible. Every record is enclosed in the RECORD_ tags and is provided a unique ID which is randomly generated. Figure 1 shows Sample Discharge Summary Excerpt; a sample discharge summary from the training dataset where the goal is to identify the PHI (private health information) terms. In this summary, some the PHI terms are doctors’ name (“Dr. Do Little”), patient name (“John Doe”) and hospital name (“ABHG”, “SBHG”). A TEXT_ tag encloses the text of different records. Each PHI instance is enclosed within PHI_ tags and the PHI TYPE represents the category of the PHI term as shown in Figure 1.

Figure 1.

Sample discharge summary excerpt


Key Terms in this Chapter

Health Insurance Portability and Accountability Act (HIPAA): An act formulated in 1996 for the secure transmission and usability of the private health information electronically.

Support Vector Machine (SVM): A supervised learning model used in the classification that learns from the data by developing a model that maximizes the error margin.

Machine Learning: A field of computer science that exploits the development of the algorithms for making a prediction on the data on the basis of its learning.

Medical Natural Language Processing: A domain of natural language processing that focuses on text mining of the medical records like clinical texts.

Conditional Random Field (CRF): An undirected probabilistic graphic model categorized under statistical modeling method used for structured prediction in machine learning and pattern recognition.

Classification: Classification in the machine learning is defined as the supervised learning technique where problem is to identify the class of the new observation with the already developed observations through the labeled data.

Natural Language Processing (NLP): A domain of computer science, computational linguistics and artificial intelligence that focuses on computers understand to human languages or natural language.

Named Entity Recognition (NER): An entity extraction task which aims to identify and retrieve the text carrying relevant information into some predefined categories.

Complete Chapter List

Search this Book: