De-Identification of Unstructured Textual Data using Artificial Immune System for Privacy Preserving

De-Identification of Unstructured Textual Data using Artificial Immune System for Privacy Preserving

Amine Rahmani (Department of Computer Science, Dr. Tahar Moulay University of Saida, Saida, Algeria), Abdelmalek Amine (Department of Computer Science, Dr. Tahar Moulay University of Saida, Saida, Algeria), Reda Mohamed Hamou (Department of Computer Science, Dr. Tahar Moulay University of Saida, Saida, Algeria), Mohamed Amine Boudia (Department of Computer Science, Dr. Tahar Moulay University of Saida, Saida, Algeria) and Hadj Ahmed Bouarara (GeCoDe laboratory, Department of Computer Sciences, Dr. Tahar Moulay University of Saida, Algeria)
Copyright: © 2016 |Pages: 16
DOI: 10.4018/IJDSST.2016100103
OnDemand PDF Download:
No Current Special Offers


The development of new technologies has led the world into a tipping point. One of these technologies is the big data which made the revolution of computer sciences. Big data has come with new challenges. These challenges can be resumed in the aim of creating scalable and efficient services that can treat huge amounts of heterogeneous data in small scale of time while preserving users' privacy. Textual data occupy a wide space in internet. These data could contain information that can lead to identify users. For that, the development of such approaches that can detect and remove any identifiable information has become a critical research area known as de-identification. This paper tackle the problem of privacy in textual data. The authors' proposed approach consists of using artificial immune systems and MapReduce to detect and hide identifiable words with no matter on their variants using the personnel information of the user from his profile. After many experiments, the system shows a high efficiency in term of number of detected words, the way they are hided with, and time of execution.
Article Preview


“BIG DATA” is one of the terms that made the buzz in the last few years. This term had spread up to describe the explosion of data over the web. Meanwhile, there is no exact definition of the concept of Big Data. Some experts define it as more than can fit on a personal computer. Others go further more by defining it as not only the massive amounts of data but the tools that show the patterns within it. While others has chosen to be more metaphorical by defining BIG DATA as the process of helping the planet grow a nervous system in which humans are just another type of sensors. However, Rick Smolan, writer and editor of the book “The Human Face of Big Data”, had wrote an essay on that book entitled “A Planetary Nervous System” in which he had defined BIG DATA as: “… an extraordinary knowledge revolution that is sweeping, almost, invisibly through business, academia, government, health care, and everyday life…” (Smolan, 2013).

One of the advantages of big data’s services is the ability of sharing and publish data over the network. Those data can be sorted in two major categories: normal like books and other textual documents, and sensitive information such as names, medical books, and social information generally. Those last requires a high tier of protection for its importance and sensitivity because if it will be linked together, it offers at a certain point a complete view about persons and conduct in many cases to a unique identification of persons even if this data does not contain any explicit identifiers. The aggregation of this information can presents a unique identity of the person as like as the fingerprint. In addition, the data, once are stored on the web, it becomes accessible and treatable by a third party and, therefore, by other people who shared the same resources which make the privacy an essential aim to ensure. That's what gives birth to a new domain known as Privacy Preserving Data Publishing (PPDP), (Vassilios, 2004) and (Evfimievski, 2009), which offers a set of methods and techniques for protection of users’ privacy. Many deeds are performed within this arena and a lot of approaches are published and used for that, these approaches can be covered on three essential groups:

  • Heuristic based approaches in which a set of works are done using data mining algorithms in the form of adaptive modification of selected data. This is based on the fact that the selective data modification is an NP-hard problem so that this group of methods is addressed to the complex problems.

  • Cryptography based approaches that are represented by a secure multiparty computation where the privacy is guaranteed basing on a probabilistic function in order to ensure that at the end for multiparty computations neither party knows except its own input and the final results of computation.

  • Perturbation and re-construction of data in which the proposed approaches consist of ensuring data by re-constructing randomly the distribution of data on such aggregated level.

One of the techniques of PPDP is the de-identification in which such system consists to detect and remove any information leads to the individuality of such user through his own data. The Privacy Technical Assistance Centre had published a report in 2013, (PTAC, 2013), in which they defined de-identification as: “… the process of removing or obscuring any personally identifiable information from student records in a way that minimizes the risk of unintended disclosure of the identity of individuals and information about them…”. In this work we propose a new approach based on Immune system in order to ensure privacy by detecting and modifying the information leading to identity of users so that we start, in the rest of the paper, with a presentation of basic concepts such as PPDP and its techniques focusing on de-identification and modification technique. Then we pass to the presentation of our idea and its results. And finally, we finished with the discussion of results and the final conclusion.

Complete Article List

Search this Journal:
Open Access Articles
Volume 14: 4 Issues (2022): Forthcoming, Available for Pre-Order
Volume 13: 4 Issues (2021): 3 Released, 1 Forthcoming
Volume 12: 4 Issues (2020)
Volume 11: 4 Issues (2019)
Volume 10: 4 Issues (2018)
Volume 9: 4 Issues (2017)
Volume 8: 4 Issues (2016)
Volume 7: 4 Issues (2015)
Volume 6: 4 Issues (2014)
Volume 5: 4 Issues (2013)
Volume 4: 4 Issues (2012)
Volume 3: 4 Issues (2011)
Volume 2: 4 Issues (2010)
Volume 1: 4 Issues (2009)
View Complete Journal Contents Listing