Article Preview
Top1. Introduction
Personalized medicine (PM) refers to the individualization of medical treatments based on the unique dataset of each patient. It generates and exploits stored patient data, which are often captured digitally in an “Electronic Health Record (EHR)” comprising profiles of many different patients. Essentially, an EHR refers to a longstanding, comprehensive health database resource that stores and manages all patient data files digitally under the custody of a licensed health entity. More specifically, it provides a digitalized view of the patient’s demographics, data associated with the patient’s clinical and medication history, diagnostic trajectory, social and economic environmental conditioning, geographical relocation, if any, as well as the patient’s genetic data, if these exist (Jensen et al., 2012).
Together, this massive data resource available via the EHR often includes not only homogeneous, heterogeneous, structured, unstructured and/or semi-structured data, but also the temporal and non-temporal data. Mixed in this huge bag of patient data are many captured medical events of different individual patients such as their body temperature measurements, blood pressure recordings and other time-series information, with different sorts and forms of data. As Ghazi (2015) noted, we consider time-series data to include all the observational sequences of a patient being captured vis-à-vis a medical event. Moreover, the EHR data resource contains a lot of hidden information and knowledge waiting to be mined and/or discovered. The process of reporting, evaluation, and medical decision-making based on the EHR data involves the extraction of relevant information and knowledge via specialized methods known as data mining techniques. The quality of information processing and knowledge discovery are thus directly linked to the availability, accessibility, type and form of the data to be extracted and aggregated for analysis. The objective of our work is to produce a high fidelity model for the representation of PM structured data. This is a challenging problem and our proposed model addresses several important scientific gaps: data heterogeneity, loss of data during data transformation, and interpretability of the representation over the course of a data mining process. To accomplish this non-trivial task, we represent the data by two parts. The first is dedicated to the representation of numeric data with clustering techniques, whereas the second part considers the representation of nominal data with respect to its dispersion. These two bodies of information are then joined into a single global representation table. Thanks to the simplicity of the obtained representation, healthcare specialists will be able to identify in the dataset both the key patient events, as well as the variations in the information conveyed by the data series. However, it is intended for the obtained representation to be used within automated medical decision-making processes such as disease prevention and/or adverse drug events prediction. Importantly, this paper emphasizes the need to explore the EHR data mining process that informs and challenges PM, which will ultimately enhance the ability of physicians and other care professionals to personalize high quality care to the inflicted individuals.
The rest of the paper is organized as follows. Section 2 explains the time-series representation process limitations. A novel data representation model proposed for PM is then detailed in Section 3 with Section 4 continuing on the discussion about the experimentation and the evaluation of the proposed model and the results analysis. The final section, Section 5, will provide concluding remarks and offer insights into practical implications and potential future works.