An Ensemble Random Forest Algorithm for Privacy Preserving Distributed Medical Data Mining

An Ensemble Random Forest Algorithm for Privacy Preserving Distributed Medical Data Mining

Musavir Hassan, Muheet Ahmed Butt, Majid Zaman
Copyright: © 2021 |Pages: 23
DOI: 10.4018/IJEHMC.20211101.oa8
Article PDF Download
Open access articles are freely available for download

Abstract

As the voluminous amount of data is generated because of inexorably widespread proliferation of electronic data maintained using the Electronic Health Records (EHRs). Medical health facilities have great potential to discern the patterns from this data and utilize them in diagnosing a specific disease or predicting outbreak of an epidemic etc. This discern of patterns might reveal sensitive information about individuals and this information is vulnerable to misuse. This is, however, a challenging task to share such sensitive data as it compromises the privacy of patients. In this paper, a random forest-based distributed data mining approach is proposed. Performance of the proposed model is evaluated using accuracy, f-measure and appa statistics analysis. Experimental results reveal that the proposed model is efficient and scalable enough in both performance and accuracy within the imbalanced data and also in maintaining the privacy by sharing only useful healthcare knowledge in the form of local models without revealing and sharing of sensitive data.
Article Preview
Top

1. Introduction

The age of big data has empowered several relations to gather extensive volumes of information. In many real world applications data required for crucial data mining tasks is distributed among several parties. To find useful patterns from the data and discover knowledge that can’t be mined from the data of single party, these parties must share data. It is unfeasible to centralize the data from participating parties due to huge communication costs, computation costs, central storage requirements, security and most importantly privacy concerns. To overcome the drawbacks of centralized system, efficient global models can be constructed from collaborative participants. But this collaborative participation is challenging due to the privacy concerns of participants, as sharing of data among the participants is required. Thus, various distributed data mining algorithms have been proposed in literature to mine different patterns extracted from data shared among different participants without revealing the original data.

Data shared among different participants may have the same attributes at each participant location; such data is said to be horizontally partitioned. For example, medical data of patients who suffer from a common disease will have the same attributes maintained with each medical facility. On the other hand, data belonging to a specific entity may be shared among different participants such that different participants store different attributes of the same entity. Such data is said to be vertically partitioned data. For example, medical data of a patient may be stored by a medical facility whereas data regarding medical bill data, health cover information, etc. of the same patient may be stored by an insurance company. Various distributed privacy preserving approaches based on different machine learning algorithms to mine horizontally and vertically partitioned data have been proposed in the literature. One such approach is to perform local data mining at different participant locations in parallel to produce local data models and keep the disjoint datasets to their respective locations. These local models are then transmitted to a central site that combines them into a global model (Myneni and Patel (1999), Chawlaet al. (2004), Tsoumakas (2003)). The second approach is that, from each local site original data is sub-sampled and then accumulated at a central site to form a global subset (Chawlaet al. (2004)). Another approach is to introduce perturbation in local data of participants with the help of a third-party coordinator in order to preserve the privacy of data. The perturbed data from each participant can then be published in the form of a centralized database to perform different data mining tasks as done by Sheela and Vijayalakshmi (2017). Distributed data mining algorithms that work in a fully decentralized manner have also been proposed in literature. The participants involved, mine shared data by using message passing mechanism. Such algorithms are characterized by the distribution of data on each participant site and asynchronous communication so as to enable learning from participants that aren't available at a given time. Such algorithms should also be scalable so as to work with more participants and therefore more data which may be added to the system at a later time. An important consideration while using decentralized distributed data mining algorithms is to preserve the privacy of data local to each participant. There are potential weaknesses in above mentioned techniques that may put the privacy of the data at risk. Moreover, different privacy preserving methods used in these techniques have certain limitations discussed in Hassan et al. (2017).

Complete Article List

Search this Journal:
Reset
Volume 15: 1 Issue (2024): Forthcoming, Available for Pre-Order
Volume 14: 1 Issue (2023)
Volume 13: 5 Issues (2022): 4 Released, 1 Forthcoming
Volume 12: 6 Issues (2021)
Volume 11: 4 Issues (2020)
Volume 10: 4 Issues (2019)
Volume 9: 4 Issues (2018)
Volume 8: 4 Issues (2017)
Volume 7: 4 Issues (2016)
Volume 6: 4 Issues (2015)
Volume 5: 4 Issues (2014)
Volume 4: 4 Issues (2013)
Volume 3: 4 Issues (2012)
Volume 2: 4 Issues (2011)
Volume 1: 4 Issues (2010)
View Complete Journal Contents Listing