Privacy Preserving Feature Selection for Vertically Distributed Medical Data based on Genetic Algorithms and Naïve Bayes

Privacy Preserving Feature Selection for Vertically Distributed Medical Data based on Genetic Algorithms and Naïve Bayes

Boudheb Tarik (EEDIS Laboratory, Djillali Liabes University, Sidi Bel Abbès, Algeria) and Elberrichi Zakaria (EEDIS Laboratory, Djillali Liabes University, Sidi Bel Abbès, Algeria)
Copyright: © 2018 |Pages: 22
DOI: 10.4018/IJISMD.2018070101


Machine learning is a powerful tool to mine useful knowledge from vast databases. Many establishments in the medical area such as hospitals, laboratories want to join their efforts with the ambition to extract models that are more accurate. However, this approach faces problems. Due to the laws protecting patient privacy or other similar concerns, parties are reluctant to share their data. In vast amounts of data, which are useful and pertinent in constructing accurate data mining models? In this article, the researchers deal with these challenges for vertically distributed medical data. They propose an original secure wrapper solution to perform feature selection based on genetic algorithms and distributed Naïve Bayes. Contrary to the previous solutions, the original data is not perturbed. Therefore, the data utility and performance are preserved. They prove that the proposed solution selects relevant attributes to increase performance, preserving patient privacy.
Article Preview

1. Introduction

Medicine has a special status in science, philosophy, and daily life. The outcomes of medical care are life-or-death, and they apply to everybody. Medicine is a necessity, not merely an optional luxury, pleasure, or convenience. The only justification for collecting medical data is to benefit the individual patient (Cios & Moore, 2002). One of the major challenges in the medical domain today is how to exploit the vast amount of data that this field generates. Machine-learning approaches are required (Anguera, Barreiro, Lara & Lizcano, 2016). They are able of discovering useful knowledge for decision making in the medical field. Data mining holds great potentials for the healthcare area. Some experts believe the opportunities to improve care and reduce costs concurrently could apply to as much as 30% of overall healthcare spending (Eliason & Crockett, 2017).

Nowadays, due to the progress in network and storage technologies, different patients’ health records, such as health diagnosis, blood analysis and radiology results are collected. The data can be stored on different sites since patients during their lifetime can visit different available hospitals or laboratories, etc. With the aim of conceiving more accurate models, some organizations would like to collaborate to enhance their data mining process, by using additional external information. In vertically distributed data, they will use external attributes, about the same patients. For example, a hospital that treats the breast cancer may use external information of its patients, such as blood analysis, biopsy, radiology, MRI scan, etc.

There is no guarantee that external information will enhance the data mining process, some non-pertinent data can decrease the actual performance of the model. Feature selection is a technique to select relevant attributes to build more accurate data mining models. It reduces dimensionality, speeds up the learning and improves the model interpretability. There are three categories of feature selection methods: Wrapper, Filter and embedded methods. The wrapper methods consider the selection of a set of features as a search problem, where different combinations are prepared, evaluated and compared to the other combinations. A predictive model is used to evaluate a combination of features and assign a score based on model accuracy (Brownlee, 2014). Filter methods apply a statistical measure to assign a scoring to each attribute. The features are ranked by the score and either selected to be kept or removed from the dataset. Embedded methods combine the qualities of filter and wrapper methods. It’s implemented by algorithms that have their own built-in feature selection methods (Kaushik, 2016).

Due to the laws or other concerns, which prohibit the disclosure of private information about individuals, organizations such as hospitals, clinics, and laboratories are reluctant to share their local data. However, data mining process needs complete access to the data to construct accurate models. Thus, in the last decade, privacy preserving of sensitive data has become an important topic. It must be incorporated in all data mining process. Privacy preserving feature selection has received the great attention. Many solutions were proposed for distributed data, but few of them used wrapper methods without perturbing the original data. The challenge with the perturbation techniques is to find a good tradeoff between privacy and accuracy (Zhong & Wright, 2005). The more patients’ private information is protected, the less accurate result the miner obtains; conversely, more accurate results, less privacy for patients.

Complete Article List

Search this Journal:
Open Access Articles: Forthcoming
Volume 11: 4 Issues (2020): Forthcoming, Available for Pre-Order
Volume 10: 4 Issues (2019)
Volume 9: 4 Issues (2018)
Volume 8: 4 Issues (2017)
Volume 7: 4 Issues (2016)
Volume 6: 4 Issues (2015)
Volume 5: 4 Issues (2014)
Volume 4: 4 Issues (2013)
Volume 3: 4 Issues (2012)
Volume 2: 4 Issues (2011)
Volume 1: 4 Issues (2010)
View Complete Journal Contents Listing