Privacy Protection in Enterprise Social Networks Using a Hybrid De-Identification System

Privacy Protection in Enterprise Social Networks Using a Hybrid De-Identification System

Mohamed Abdou Souidi, Noria Taghezout
Copyright: © 2021 |Pages: 15
DOI: 10.4018/IJISP.2021010107
OnDemand:
(Individual Articles)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

Enterprise social networks (ESN) have been widely used within organizations as a communication infrastructure that allows employees to collaborate with each other and share files and documents. The shared documents may contain a large amount of sensitive information that affect the privacy of persons such as phone numbers, which must be protected against any kind of disclosure or unauthorized access. In this study, authors propose a hybrid de-identification system that extract sensitive information from textual documents shared in ESNs. The system is based on both machine learning and rule-based classifiers. Gradient boosted trees (GBTs) algorithm is used as machine learning classifier. Experiments ran on a modified CoNLL 2003 dataset show that GBTs algorithm achieve a very high F1-score (95%). Additionally, the rule-based classifier is consisted of regular expression and gazetteers in order to complement the machine learning classifier. Thereafter, the sensitive information extracted by the two classifiers are merged and encrypted using Format Preserving Encryption method.
Article Preview
Top

Introduction

Social networks (SNs) have become an indispensable tool in the daily life of people. In 2018, out of 4 billion users of the internet around the world, more than 3 billion users were active on social networks (We are social, 2018).

Enterprises, for their part, are adopting social networks for the development of collaboration and information sharing among employees. An enterprise social network (ESN) is a system based on exchanges within collaborative environments in a professional background. The last decade has seen a broad emergence of platforms dedicated to this new dimension of social networks, and many ESNs have emerged.

However, employees tend to share different types of documents, bills and records in the enterprise social network. Among the shared files, we find a considerable amount of sensitive information and regulated data such as credit card numbers, Social Security Numbers, drivers' license information, names and nationalities.

Sensitive information is data that requires protection against any unauthorized disclosure or access. Several types of sensitive information exist, such as Protected Health Information (HIPAA Journal, 2019), Personal Information (Identity Theft Protection Act, 2005), Customer record information (Privacy of Consumer Financial Information, 2016).

In addition, shared files containing unprotected sensitive information can raise several privacy concerns. Indeed, personal data can be traced back to an individual, which could eventually result in identity theft as well as the disclosure of information that individuals desire to remain private. Hence, the de-identification approach has emerged to protect the privacy of individuals and companies.

De-identification refers to the process of removing personally identifiable information from shared, generated, or archived data so that the remaining data becomes highly difficult to trace back to an individual. However, de-identification is far from being just a simple method; instead, it is a set of tools, techniques and algorithms applied to different types of data. Overall, it serves to protect the privacy of individuals and organizations while also minimizing the risk of data exposure. De-identifying data can thereby help organizations to use information more effectively than before (Garfinkel, 2015).

Therefore, de-identification represents a powerful privacy protection tool that covers a variety of areas such as big data, data mining, communication, social networks, and particularly for textual data. There are two main groups of methodologies employed in existing text de-identification applications: pattern matching and machine learning (Meystre et al., 2014). It is possible to find works applying a combination of both methods. Pattern recognition applications usually depend on human-defined patterns (regular expressions and gazetteers) and are easy to implement, tune and use. Besides, it does not require any training data (tagged data).

On the other hand, machine learning (ML) applications rely mainly on the training of a classifier over labelled data (dataset) to obtain a model, where words in a given text or document are classified as either sensitive or non-sensitive. Machine learning applications qualify as Named Entity Recognition (NER) applications since many of the de-identified words fall into one type of named entities such as names, places and organizations. Moreover, it requires a good and large corpus of annotated text to perform well.

Recently, many annotated corpora have appeared. First, the Conference on Computational Natural Language Learning shared task (CoNLL) 2003 dataset (Tjong Kim Sang & De Meulder, 2003). Next, the Informatics for Integrating Biology and the Bedside (i2b2) 2009 dataset (Uzuner, Solti, & Cadag, 2010) and the 2014 track1 dataset (Stubbs, Kotfila, & Uzuner, 2015). Then, the ShARe/CLEF eHealth Evaluation Lab 2013 dataset (Suominen et al., 2013). Finally, the Semantic Evaluation 2014 task 7 (Pradhan, Elhadad, Chapman, Manandhar, & Savova, 2015) and the 2016 task 12 datasets (Bethard et al., 2016). Such corpora have contributed to the evolution and development of text de-identification systems.

Complete Article List

Search this Journal:
Reset
Volume 18: 1 Issue (2024)
Volume 17: 1 Issue (2023)
Volume 16: 4 Issues (2022): 2 Released, 2 Forthcoming
Volume 15: 4 Issues (2021)
Volume 14: 4 Issues (2020)
Volume 13: 4 Issues (2019)
Volume 12: 4 Issues (2018)
Volume 11: 4 Issues (2017)
Volume 10: 4 Issues (2016)
Volume 9: 4 Issues (2015)
Volume 8: 4 Issues (2014)
Volume 7: 4 Issues (2013)
Volume 6: 4 Issues (2012)
Volume 5: 4 Issues (2011)
Volume 4: 4 Issues (2010)
Volume 3: 4 Issues (2009)
Volume 2: 4 Issues (2008)
Volume 1: 4 Issues (2007)
View Complete Journal Contents Listing