Article Preview
Top1. Introduction
Medicine has a special status in science, philosophy, and daily life. The outcomes of medical care are life-or-death, and they apply to everybody. Medicine is a necessity, not merely an optional luxury, pleasure, or convenience. The only justification for collecting medical data is to benefit the individual patient (Cios & Moore, 2002). One of the major challenges in the medical domain today is how to exploit the vast amount of data that this field generates. Machine-learning approaches are required (Anguera, Barreiro, Lara & Lizcano, 2016). They are able of discovering useful knowledge for decision making in the medical field. Data mining holds great potentials for the healthcare area. Some experts believe the opportunities to improve care and reduce costs concurrently could apply to as much as 30% of overall healthcare spending (Eliason & Crockett, 2017).
Nowadays, due to the progress in network and storage technologies, different patients’ health records, such as health diagnosis, blood analysis and radiology results are collected. The data can be stored on different sites since patients during their lifetime can visit different available hospitals or laboratories, etc. With the aim of conceiving more accurate models, some organizations would like to collaborate to enhance their data mining process, by using additional external information. In vertically distributed data, they will use external attributes, about the same patients. For example, a hospital that treats the breast cancer may use external information of its patients, such as blood analysis, biopsy, radiology, MRI scan, etc.
There is no guarantee that external information will enhance the data mining process, some non-pertinent data can decrease the actual performance of the model. Feature selection is a technique to select relevant attributes to build more accurate data mining models. It reduces dimensionality, speeds up the learning and improves the model interpretability. There are three categories of feature selection methods: Wrapper, Filter and embedded methods. The wrapper methods consider the selection of a set of features as a search problem, where different combinations are prepared, evaluated and compared to the other combinations. A predictive model is used to evaluate a combination of features and assign a score based on model accuracy (Brownlee, 2014). Filter methods apply a statistical measure to assign a scoring to each attribute. The features are ranked by the score and either selected to be kept or removed from the dataset. Embedded methods combine the qualities of filter and wrapper methods. It’s implemented by algorithms that have their own built-in feature selection methods (Kaushik, 2016).
Due to the laws or other concerns, which prohibit the disclosure of private information about individuals, organizations such as hospitals, clinics, and laboratories are reluctant to share their local data. However, data mining process needs complete access to the data to construct accurate models. Thus, in the last decade, privacy preserving of sensitive data has become an important topic. It must be incorporated in all data mining process. Privacy preserving feature selection has received the great attention. Many solutions were proposed for distributed data, but few of them used wrapper methods without perturbing the original data. The challenge with the perturbation techniques is to find a good tradeoff between privacy and accuracy (Zhong & Wright, 2005). The more patients’ private information is protected, the less accurate result the miner obtains; conversely, more accurate results, less privacy for patients.