Anonymity and Pseudonymity in Data-Driven Science

Anonymity and Pseudonymity in Data-Driven Science

Heidelinde Hobel (SBA Research, Austria), Sebastian Schrittwieser (St. Poelten University of Applied Sciences, Austria), Peter Kieseberg (SBA Research, Austria) and Edgar Weippl (Vienna University of Technology and SBA Research, Austria)
Copyright: © 2014 |Pages: 7
DOI: 10.4018/978-1-4666-5202-6.ch013
OnDemand PDF Download:
$30.00
List Price: $37.50

Abstract

Data Mining, Business Intelligence and other empirical approaches have great potential for various fields of research activities considering the improvements in data processing and retrieval. By using an empirical approach for the research methodology, the underlying data set should be available, at least for the review process in order to prevent fraud and ensure quality of research. However, data disclosure of the research data raises considerable privacy concerns due to the liability of the scholars to protect the privacy of their volunteers and adhere to the privacy policies of protected data. Therefore it is important to know about the strengths and weaknesses of existing approaches of anonymization, pseudonymization and attacks such as inference attacks. This chapter will provide an overview.
Chapter Preview
Top

Background

According to recent publications, science is becoming more data-driven (Bonneau, 2012; Chia, Yamamoto, & Asokan, 2012; Dey, Zubin, & Ross, 2012; Siersdorfer, Chelaru, Nejdl, & Pedro, 2010; West & Leskovec, 2012; Zang & Bolot, 2011). The term “Big Data” has originally emerged from the IT-sector, where large data samples had to be analyzed. In many publications, large data sets are used to evaluate proposed prototypes or algorithms, especially concerning practical applicability and with regards to performance issues. Furthermore, they also serve as underlying research foundation for new empirical findings that can be derived from analyzing the data for general trends and characteristics. However, we can learn from privacy-preserving health data publishing where sensitive data from patients has to be protected from leaking into the public (Sweeney, 2002a; Sweeney, 2002b). From this we learned that not only direct identifiers, like the social security numbers, may contribute to the threat of a privacy breach, but also quasi-identifiers (QI), e.g., the triple ZIP, birth date and gender, could lead to a possible identification of a person so that private data like diseases about patients could be inferred about identified patients for malicious purpose (Sweeney, 2002a; Sweeney, 2002b). But not only in health care huge amounts of data could lead to new findings or evidence for assumptions. For instance, Dey et al. (2012) have analyzed 1.400.000 Facebook account settings in order to infer privacy trends for several personal attributes. Although public accounts were used for their research methods, their results combined with their measured and recorded data are highly sensitive and should not be published without appropriate anonymizations or pseudonymization techniques. The effort of building relationships between the sensitive published data and data that is public or easily accessible for attackers is denoted as data linkage (Fung, Wang, Chen, & Yu, 2010). Altogether, as we depend on the data disclosure of volunteers, we are responsible for preserving data privacy. This includes ensuring the unlinkability of sensitive data such that the data records can be published to facilitate validation of research, collaboration between several science groups and for the personal learning effect by enabling the reader to repeat the proposed scientific experiment.

Key Terms in this Chapter

Pseudonymization: The reversible obfuscation of the record owners’ identities.

Publish or Perish: The term that describes the pressure of scholars to publish frequently new publications.

Privacy: The state in which an individual is able to hold back information so that it will not leak into public.

Big Data: A buzzword for huge amounts of data.

Data-Driven Science: An empirical research method that is aimed to reason about huge data amounts.

Inference Attack: The process where sensitive information is inferred for a malicious purpose.

Anonymization: The obfuscation of the record owners’ identities.

Complete Chapter List

Search this Book:
Reset