Article Preview
TopIntroduction
Social networks (SNs) have become an indispensable tool in the daily life of people. In 2018, out of 4 billion users of the internet around the world, more than 3 billion users were active on social networks (We are social, 2018).
Enterprises, for their part, are adopting social networks for the development of collaboration and information sharing among employees. An enterprise social network (ESN) is a system based on exchanges within collaborative environments in a professional background. The last decade has seen a broad emergence of platforms dedicated to this new dimension of social networks, and many ESNs have emerged.
However, employees tend to share different types of documents, bills and records in the enterprise social network. Among the shared files, we find a considerable amount of sensitive information and regulated data such as credit card numbers, Social Security Numbers, drivers' license information, names and nationalities.
Sensitive information is data that requires protection against any unauthorized disclosure or access. Several types of sensitive information exist, such as Protected Health Information (HIPAA Journal, 2019), Personal Information (Identity Theft Protection Act, 2005), Customer record information (Privacy of Consumer Financial Information, 2016).
In addition, shared files containing unprotected sensitive information can raise several privacy concerns. Indeed, personal data can be traced back to an individual, which could eventually result in identity theft as well as the disclosure of information that individuals desire to remain private. Hence, the de-identification approach has emerged to protect the privacy of individuals and companies.
De-identification refers to the process of removing personally identifiable information from shared, generated, or archived data so that the remaining data becomes highly difficult to trace back to an individual. However, de-identification is far from being just a simple method; instead, it is a set of tools, techniques and algorithms applied to different types of data. Overall, it serves to protect the privacy of individuals and organizations while also minimizing the risk of data exposure. De-identifying data can thereby help organizations to use information more effectively than before (Garfinkel, 2015).
Therefore, de-identification represents a powerful privacy protection tool that covers a variety of areas such as big data, data mining, communication, social networks, and particularly for textual data. There are two main groups of methodologies employed in existing text de-identification applications: pattern matching and machine learning (Meystre et al., 2014). It is possible to find works applying a combination of both methods. Pattern recognition applications usually depend on human-defined patterns (regular expressions and gazetteers) and are easy to implement, tune and use. Besides, it does not require any training data (tagged data).
On the other hand, machine learning (ML) applications rely mainly on the training of a classifier over labelled data (dataset) to obtain a model, where words in a given text or document are classified as either sensitive or non-sensitive. Machine learning applications qualify as Named Entity Recognition (NER) applications since many of the de-identified words fall into one type of named entities such as names, places and organizations. Moreover, it requires a good and large corpus of annotated text to perform well.
Recently, many annotated corpora have appeared. First, the Conference on Computational Natural Language Learning shared task (CoNLL) 2003 dataset (Tjong Kim Sang & De Meulder, 2003). Next, the Informatics for Integrating Biology and the Bedside (i2b2) 2009 dataset (Uzuner, Solti, & Cadag, 2010) and the 2014 track1 dataset (Stubbs, Kotfila, & Uzuner, 2015). Then, the ShARe/CLEF eHealth Evaluation Lab 2013 dataset (Suominen et al., 2013). Finally, the Semantic Evaluation 2014 task 7 (Pradhan, Elhadad, Chapman, Manandhar, & Savova, 2015) and the 2016 task 12 datasets (Bethard et al., 2016). Such corpora have contributed to the evolution and development of text de-identification systems.