A Semantic Approach for Semi-Automatic Detection of Sensitive Data

A Semantic Approach for Semi-Automatic Detection of Sensitive Data

Jacky Akoka (CEDRIC-CNAM & Institut Mines-Telecom TEM, Paris, France), Isabelle Comyn-Wattiau (CEDRIC-CNAM & ESSEC Business School, Paris, France), Cédric Du Mouza (CEDRIC-CNAM, Paris, France), Hammou Fadili (Maison des Sciences de l'Homme, Paris, France), Nadira Lammari (CEDRIC-CNAM, Paris, France), Elisabeth Metais (CEDRIC-CNAM, Paris, France) and Samira Si-Said Cherfi (CEDRIC-CNAM, Paris, France)
Copyright: © 2014 |Pages: 22
DOI: 10.4018/irmj.2014100102
OnDemand PDF Download:
No Current Special Offers


This article proposes an innovative approach and its implementation as an expert system to achieve the semi-automatic detection of candidate attributes for scrambling sensitive data. Its approach is based on semantic rules that determine which concepts have to be scrambled, and on a linguistic component that retrieves the attributes that semantically correspond to these concepts. Because attributes cannot be considered independently from each other, it also addresses the challenging problem of the propagation of the scrambling process through the entire database. One main contribution of this article's approach is to provide a semi-automatic process for the detection of sensitive data. The underlying knowledge is made available through production rules operationalizing the detection of the sensitive data. A validation of its approach using four different databases is provided.
Article Preview


In an ever-changing competing environment, organizations are under increasing pressure to find ways of protecting the sensitivity of data regarding both individuals and organizations. Companies face the challenge of creating and/or updating non-production environments for testing purposes. In general, companies create a copy of the production system, which may include the data repository and the administrative settings. They provide a test environment for improving application delivery process. However, there are many risks associated with the test environments open to external consultants. In some cases, testing can become a liability. Thus, there is a need to provide realistic test data. This requires masking techniques for sensitive data. By masking the data, users see only a representation of the data without having access to the sensitive ones. Data related to customers, products, materials, and financial accounts may be sensitive and should be masked or anonymized. Sensitive information like address, telephone numbers, and contact information has to be de-identified. By scrambling the data, we substitute sensitive information on customers, orders, products, order profitability, etc., with fictitious, but still consistent data, preserving the overall structure and semantics of the test database. Data masking, also referred as data obfuscation, data de-identification, data depersonalization, data scrubbing, etc., represents a solution for data protection from both internal and external security threats. It enables the creation and/or the updating of data in non-production environments, without the risk of exposing sensitive information to unauthorized users, such as external consultants in environments like ERP systems. Let us mention that, unlike encrypted data, masked data maintain their usability in testing environments. Data masking encompasses several techniques such as generalization, mutation algorithms, customization, etc. It can use shuffling techniques for names protecting. A related technique called linked shuffling can de-identify the address. Phone numbers can be scrambled using random number generators. Date transformer allows obfuscating dates. Finally, account generator performs data de-personalization of account numbers. Thus, data masking tasks provides several benefits, such as providing realistic data for off-site and offshore software testing. Even if techniques and tools are available, scrambling huge databases is a fastidious process. Surveys reveal that companies often neglect to scramble their datasets, generating business and financial risks. ERP systems rely on thousands of tables, each one composed of more than twenty columns. Deciding which column is sensitive and has to be scrambled is an enormous task.

Protecting data privacy using scrambling is composed of three steps. The first step deals with the choice of data to be hidden, notably anonymized, randomized, swapped or, more generally, obfuscated (Bakken, Parameswaran, Blough, Franz, & Palmer, 2004). The second step consists in choosing, for each sensitive part of the database, the adequate scrambling technique, particularly but not exhaustively among those mentioned above (Fung, Wang, Chen, & Yu, 2010). The third step is related to the application of data sanitization to the entire dataset preserving data integrity. To the best of our knowledge, the literature and industrial solutions are concentrated on this third step (Ravikumar et al., 2011; Askari, Safavi-Naini,& Barker, 2012). This paper contributes to the first step by proposing an innovative technique that automates the detection of the sensitive attributes. By semantically modeling the different data, we enable the semi-automatic detection of data sensitivity. This technique encompasses two functionalities: (1) automatic detection of the values to be scrambled; that have to be validated by a domain expert and, (2) automatic propagation to other semantically linked values.

Our contribution is original in the sense that it encapsulates general and domain knowledge into rules. We propose a rule-based approach implemented under an expert system architecture. Rules are devoted to the selection of sensitive data with regard to their semantics. Furthermore we present a deduction mechanism modeled by a semantic graph to ensure the propagation of the sensitivity on near values and the consistency with the other relations. Moreover we propose a prototype with a set of clever interfaces to capture the rules. Let us mention three important aspects of our approach:

Complete Article List

Search this Journal:
Open Access Articles
Volume 35: 4 Issues (2022): 1 Released, 3 Forthcoming
Volume 34: 4 Issues (2021)
Volume 33: 4 Issues (2020)
Volume 32: 4 Issues (2019)
Volume 31: 4 Issues (2018)
Volume 30: 4 Issues (2017)
Volume 29: 4 Issues (2016)
Volume 28: 4 Issues (2015)
Volume 27: 4 Issues (2014)
Volume 26: 4 Issues (2013)
Volume 25: 4 Issues (2012)
Volume 24: 4 Issues (2011)
Volume 23: 4 Issues (2010)
Volume 22: 4 Issues (2009)
Volume 21: 4 Issues (2008)
Volume 20: 4 Issues (2007)
Volume 19: 4 Issues (2006)
Volume 18: 4 Issues (2005)
Volume 17: 4 Issues (2004)
Volume 16: 4 Issues (2003)
Volume 15: 4 Issues (2002)
Volume 14: 4 Issues (2001)
Volume 13: 4 Issues (2000)
Volume 12: 4 Issues (1999)
Volume 11: 4 Issues (1998)
Volume 10: 4 Issues (1997)
Volume 9: 4 Issues (1996)
Volume 8: 4 Issues (1995)
Volume 7: 4 Issues (1994)
Volume 6: 4 Issues (1993)
Volume 5: 4 Issues (1992)
Volume 4: 4 Issues (1991)
Volume 3: 4 Issues (1990)
Volume 2: 4 Issues (1989)
Volume 1: 1 Issue (1988)
View Complete Journal Contents Listing