Unstructured Healthcare Data Archiving and Retrieval Using Hadoop and Drill

Unstructured Healthcare Data Archiving and Retrieval Using Hadoop and Drill

Hang Yue (Johns Hopkins Healthcare LLC, Baltimore, USA)
Copyright: © 2018 |Pages: 17
DOI: 10.4018/IJBDAH.2018070103
OnDemand PDF Download:
No Current Special Offers


A healthcare hybrid Hadoop ecosystem is analyzed for unstructured healthcare data archives. This healthcare hybrid Hadoop ecosystem is composed of some components such as Pig, Hive, Sqoop and Zoopkeeper, Hadoop Distributed File System (HDFS), MapReduce and HBase. Also, Apache Drill is applied for unstructured healthcare data retrieval. This article will discuss the combination of Hadoop and Drill for data analysis applications. Based on the analysis of Hadoop components, (including HBase design) and the case studies of Drill query design regarding different unstructured healthcare data, the Hadoop ecosystem and Drill are valid tools to integrate and access voluminous complex healthcare data. They can improve the healthcare systems, achieve savings on patient care costs, optimize the healthcare supply chain and infer useful knowledge from noisy and heterogeneous healthcare data sources.
Article Preview

1. Introduction

Today’s data is doubling every two years, the semi-structured data or self-describing data accounts for more than 80% of the data collected by different organizations including the healthcare industry in Holzinger, Andreas, Pasi, Gabriella (Eds. 2013). Healthcare data includes rich sources and diverse types, and new data is continuously and rapidly generated. Especially semi-structured data or self-describing data generation) is growing exponentially after 2010. Moreover, Electronic Health Records (EHRs) are often integrated from multiple data sources or different Health Information Technology (HIT) systems. These data sources include biochips, genomic data, biometric data, clinical data, claims, geo-based biosensor information, environmental factors or risks, and so on.

Due to the rapid worldwide development of communication and information technology, the digital revolution creates an E-Health Era for the communication between doctors and patients (Jonathan P Weiner, 2012). In the E-Health Era, electronic devices mediate the interaction of physicians and patients for health care delivery, making the face-to-face contacts between patients and doctors less common, which is the special nature of the doctor/patient relationship. The digital practice milieu surrounds both the provider and the patient of advanced healthcare systems. Some devices or information systems support the doctor/patient interaction, e.g. claim management/health IT systems and biometric/telemedicine/remote patient monitoring. Also, these systems store different types of communication data (including EHRs and PHRs).

In the past, Relational Databases Management System (RDBMS) or Enterprise Data Warehouse (EDW) had been the default choice for data analysis. However, the use of non-relational databases (i.e. NOSQL), such as HBase and MongoDB, is rapidly growing due to their capability to overcome cost/time challenges for big data analysis. In comparison to RDBM or EDW, NOSQL databases can much better handle structured, semi-structured and unstructured data with storage volume in TBs or PBs in a short release cycle time (Tugdual Grallv, 2016).

Big data has the 4-V features (i.e. volume, variety, velocity and veracity) and big data analysis has the 5-M features (i.e. measure, mapping, method, meaning and matching). Certainly, big data should have a large data size (e.g. the digital healthcare data will be 25,000 petabytes in 2020). Besides volume, big data should have different data types (e.g. medical sensor data, clinician notes, epidemiology and behavior data, social networks are useful healthcare data sources); velocity is also an important feature of big data, which is determined by the timeliness of data creation (e.g. 2 million claims are daily produced or HIT real-time archives claims into databases); as valid data sources, big data should have a good data accuracy and completeness. The data sources with large biases, noises or errors should not be viewed as acceptable data, even if these data sources have a large size and various data types (Gandomi & Haider, 2015; Courtney, 2013).

In the 5-M features, data measure is the above 4-V, and data mapping means data integration and interoperability in data analysis. Analytic methods of big data include data exploration, prediction, modeling, simulation and visualization. Also, big data analysis is the process of knowledge discovery and gain the meaning or implication from the data analytic results (Kharrazi, 2013). For instance, data analyses support evidence-based medical research or the conclusions of data results are helpful to make some clinical practice guidelines. Certainly, the outcomes of big data analyses should fit the health goals such as triple aims and the affordable care act.

Complete Article List

Search this Journal:
Volume 7: 1 Issue (2022): Forthcoming, Available for Pre-Order
Volume 6: 2 Issues (2021)
Volume 5: 2 Issues (2020)
Volume 4: 2 Issues (2019)
Volume 3: 2 Issues (2018)
Volume 2: 2 Issues (2017)
Volume 1: 1 Issue (2016)
View Complete Journal Contents Listing