Scalable l-Diversity: An Extension to Scalable k-Anonymity for Privacy Preserving Big Data Publishing

Scalable l-Diversity: An Extension to Scalable k-Anonymity for Privacy Preserving Big Data Publishing

Udai Pratap Rao (Computer Engineering Department, Sardar Vallabhbhai National Institute of Technology, Surat, India), Brijesh B. Mehta (Computer Engineering Department, Sardar Vallabhbhai National Institute of Technology, Surat, India) and Nikhil Kumar (Computer Engineering Department, Sardar Vallabhbhai National Institute of Technology, Surat, India)
DOI: 10.4018/IJITWE.2019040102

Abstract

Privacy preserving data publishing is one of the most demanding research areas in the recent few years. There are more than billions of devices capable to collect the data from various sources. To preserve the privacy while publishing data, algorithms for equivalence class generation and scalable anonymization with k-anonymity and l-diversity using MapReduce programming paradigm are proposed in this article. Equivalence class generation algorithms divide the datasets into equivalence classes for Scalable k-Anonymity (SKA) and Scalable l-Diversity (SLD) separately. These equivalence classes are finally fed to the anonymization algorithm that calculates the Gross Cost Penalty (GCP) for the complete dataset. The value of GCP gives information loss in input dataset after anonymization.
Article Preview
Top

Introduction

The success and failure of any organizations highly depend on the analysis of their business/transaction data. But size of such data is in massive form; hence, it cannot be analyzed by traditional analytical methods. To analyze and handle these data, distributed environment such as MapReduce framework (Dean & Ghemawat, 2008) is required, in which this large volume of data can be distributed over many distributed systems to process and analyze it. Almost every organization used to make their business data public for the use of researchers. This business/transaction data contains private information of their customers, so organizations need to anonymize their data before publishing it publicly.

Preserving privacy as well as to keep the high utility of data is a big challenge in order to publish the big data because data are collected from different sources which may leads to privacy issues (Wu, Zhu, Wu & Ding, 2014; Mehta & Rao, 2016). The anonymization of data will reduce the utility of underlying data. Preserving the privacy of an individual in order to publish the big data with high utility is a challenging task and can be considered as open research problem. In this paper, the discussion about two privacy model k-anonymity and l-diversity is given and further we propose scalable algorithms for k-anonymity and l-diversity. The results of SKA and SLD for different value of k and l for large dataset are also compared.

Privacy Models

Mehta, Rao, Kumar & Gadekula (2016) discussed about the different privacy models and concluded that for big data k-anonymity and l-diversity are more suitable to preserve the privacy. As l-diversity is an extension to k-anonymity, first k-anonymization need to be applied on dataset and then it can be l-diversified. In both the approaches, attributes of dataset are categorizes into four types: Personal Information Identifier (PII), Quasi Identifier (QID), Sensitive Attribute (SA), and Non-sensitive attribute. PII uniquely identifies the individuals so this attribute is removed from the published table. QID is a collection of one or more attribute which alone cannot identify the data owner but its combination with publically available dataset may reveal the identity and sensitive value of the individual. The attribute which data owner do not want to disclose is known as sensitive attribute. Apart from PII, QID and SA all other attributes are called Non sensitive attribute. Now discussion about k-anonymity and l-diversity is given one by one. Table 1 is an example of the patients published data. In the dataset, UID is PII; Sex, ZIP Code and Age are QIDs; and Disease is SA.

Table 1.
Original patient data
S#UIDSexZIP CodeAgeDisease
1728953467896M85221934HIV
2786545678901M85222732Flu
3456732190876M85500743Flu
4678904523679M85501049Malaria
5890567432673F85345754Cancer
6976543097645F85340151Cancer

Complete Article List

Search this Journal:
Reset
Open Access Articles: Forthcoming
Volume 15: 4 Issues (2020): 1 Released, 3 Forthcoming
Volume 14: 4 Issues (2019)
Volume 13: 4 Issues (2018)
Volume 12: 4 Issues (2017)
Volume 11: 4 Issues (2016)
Volume 10: 4 Issues (2015)
Volume 9: 4 Issues (2014)
Volume 8: 4 Issues (2013)
Volume 7: 4 Issues (2012)
Volume 6: 4 Issues (2011)
Volume 5: 4 Issues (2010)
Volume 4: 4 Issues (2009)
Volume 3: 4 Issues (2008)
Volume 2: 4 Issues (2007)
Volume 1: 4 Issues (2006)
View Complete Journal Contents Listing