Article Preview
Top1. Introduction
Privacy preservation can be achieved by many methods such as securing private information by using cryptographic methods, Access control methods etc. But, these techniques (Narayan S, 2010; Qian H, 2015; Wang J. H, 2010) does not provide data utility. Data anonymization techniques provide data utility by providing privacy preservation. Hence, there is a huge scope for anonymization techniques to utilize the data for purpose of carrying out research, Statistical analysis for decision making and forecasting (Fung, 2010). Personal data also called as Personally Identifiable Information (PII) is either information that relates to any identifiable living individual or the pieces of information collected together that can lead to identification of a particular person. Example of the former case is a name and surname, home address, personal email address, identification card number, Internet Protocol (IP) address, cookie ID, location data etc. Examples of latter case are the data held by hospital or doctor, census data, data that is provided at the workplace etc. Data anonymization is the process of information sanitization whose main aim is to preserve privacy of personally identifiable information present in the dataset (Graham, 2009; L. Willenborgand, 1996). The anonymized data becomes meaningless if utility of data is not considered, i.e., the raw data has no privacy but has full utility and completely anonymized data has perfect privacy but no utility.
Privacy Preserving Data Publishing (PPDP) is set methods and tools with the objective of publishing the data that remain practically useful at the same time preserving the individual privacy (C M Fung, 2010; Sattar, et al.,2013; Xu, Feng, et al.,2019)) is one of the important issues of research in the domain of data privacy and security, network security, cyber physical systems, information security etc.. Most of the data today is published is in the form of microdata (Winkelmann R, 2006). A microdata is a file with n records and the record may contain m variables also called attributes of an individual of whom the information is collected. Let T be the original microdata table. At the basic level of (PPDP), in the published microdata the identifiers are removed and anonymization methods are applied on Quasi-Identifier’s, the resulting table is of the form: T` (Quasi-Identifier’s, Sensitive Attributes). From the literature ((Samarati P, 2001)), the attributes in the microdata file can be classified and defined as follows:
- 1.
Quasi-identifiers (QIDs)- used to identify the individuals but not uniquely for example- person’s age, zip code etc.
- 2.
Confidential/sensitive attributes (SA)- person’s sensitive information which needs to be secured, for example- diagnosis report, community, disease, salary, occupation, investments, opinion polls etc.
The microdata file is published by removing the identifiers that directly identify the individual, however in the published data the Quasi-Identifiers (QID’s), Sensitive Attributes (SA) are retained as these contain values useful for analytics/study/research purpose. These values that are subjected to anonymization to make sure that the published microdata is safe from possible attacks such as background knowledge attack, homogeneity attack, attribute linkage attack, skewness attack, similarity attack etc. A complete list of privacy attacks on published data is provided in (Sowmyarani C N,2017)
Table 1. Job | Gender | Age | Salary |
Engineer | Male | 30 | 50,000 |
Engineer | Male | 32 | 50,000 |
Doctor | Female | 35 | 60,000 |
Choreographer | Female | 45 | 35,000 |
Dancer | Male | 40 | 35,000 |
Dancer | Male | 42 | 35,000 |
Doctor | Male | 38 | 60,000 |
Choreographer | Male | 48 | 35,000 |