Article Preview
TopIntroduction
Many real life applications of data mining is facing problems towards the privacy preservation of the data (Anderson, 2010; Acs & Castelluccia, 2011; Dansana, 2012; van Dijk et al., 2010; Chowdhuri, 2014; Sarkar et al., 2017). It includes, firstly, certain attributes of the data or attributes that might leak the personal recognizable information. Secondly, the data can be split across multiple nodes either horizontally or vertically, and may not allow the data transfer to another side. Finally, usage of data model might have restriction on rules, and few rules may lead to law violation in order to access individual profiling. Privacy preserving based data mining (PPDM) (Agrawal, 1994) has arisen to discuss the above-mentioned issues. Majority of the PPDM techniques are the modified versions of the standard data mining algorithms, where the modification includes the cryptographic mechanisms which guarantee the privacy for the application. In many cases, restraints PPDM are: preserving data accuracy and retaining mining process performance while maintaining the privacy restrictions. Copious methodologies used by PPDM can be summarized based on following dimensions:
- •
Data Distribution: This dimension concentrates on data distribution. The approaches adopt either centralized data distribution or decentralized data distribution. Generally, the data distribution can be categorized as horizontal and vertical data distribution. While horizontal data splitting is discussed in detail in the forthcoming sections, vertical distribution distributes all values for different attributes in different places.
- •
Data Alteration: It is used to change the actual data into other form before releasing to the public in order to accomplish the data privacy. Data modification mechanisms include perturbation, blocking, aggregation (Chen et al., 2014; Won et al., 2014), swapping and sampling.
- •
Privacy Preservation: Assures the delivery of data to the intended data mine by adapting data alteration before delivering. Distribution of data is done among more than one node without revealing the data at individual site. In classification phase, where the results will be given to designate node, which does the classification, it checks for the occurrence of certain rules without disclosing them.
Many authors have proposed techniques in order to provide the confidentiality in data mining (Aggarwal, 2010; Oliveira, 2004; Rawat, 2015; Fouad & Hassan, 2016), Elaine Shi et al. (2011) has proposed a time series based aggregation mechanism in order to attain data privacy, where group participants can occasionally upload encrypted data to the group aggregator(GA), who is responsible to do the summation on data in every time periodically. The authors suggested a mechanism which allows group users to submit encoded values to data aggregator, Afterwards aggregator will perform the summation on participants’ values in every period, without prior knowledge. We achieve strong privacy using this technique.