Article Preview
TopData encryption, differential privacy, k-anonymity and many other technologies are proposed to protect the data privacy for users. The idea of k-anonymity is proposed by Samariti and L.Sweeney (Samariti and L.Sweeney, 1998). The key idea of kanonymity is to make individuals indistinguishable in a released table. A tuple representing an individual within the identifiable attributes has to be identical in at least (k-1) other tuples. This method has been widely used because of its simplicity.
The K-anonymity related algorithms could be divided into three types: global recoding, multidimensional recoding and local recoding. Global recoding algorithms, such as Datafly (Lefevre et al., 2005), Incognito (Sweeney, 2002b), TopDown (Fung et al., 2005) and BottomUp (Wang et al., 2004), require that all attributes of the tuples in dataset have the same generalization form. Although these algorithms have low computation complexity, they may cause over generalization. Multidimensional recoding, such as Mondrian, maps a set of values to another set of values, some of which are more general than the corresponding premapping values. But this model does not consider attribute hierarchical structures. Local recoding algorithms allow values of an attribute in different generalization domain. The information loss of these local recording anonymity algorithms is low, but the execution time of these algorithms is longer than that of global algorithm. Also, the model does not consider attribute hierarchical too. The typical local recoding algorithms are the KACA (Li et al., 2006), MDAV (Torra, 2004) and its L-diversity model (Jianmin et al., 2008). The optimal k-anonymity algorithm is considered as a NP-hard problem. Existing researches use heuristic strategies to gain an approximate optimal algorithm.
It is difficult to protect privacy just with k-anonymity model. Also, there are some attacks which k-anonymity is unable to resist, such as homogeneity attack, similarity attack and probability attack. Many algorithms are proposed to resist these attacks, such as p-sensitive k-anonymity (Truta & Vinay, 2006), (alpha,k)-anonymity (Wong et al., 2006), L-Diversity (Machanavajjhala et al., 2007),(a,d)-Diversity (Wang & Shi, 2009), t-closeness (Li et al., 2007), (ω; γ,k)-anonymity (Huang et al., 2014) and (l,t)-closeness anonymization (Yang et al., 2015).
In this paper, we analyzed the defect of global recoding and proposed a new algorithm Divide-Datafly. Through experiments, we compared the proposed algorithm with Datafly, Incognito and KACA. The experimental results on three different datasets show that, Divide-Datafly algorithm is suitable for dataset with numerical attribute. It improves the speed of anonymization and reduces the information loss. We also put forward an L-diversity model of the proposed algorithm based on clustering method and give experiments to analyze the execution time and information loss of it.