Article Preview
TopIntroduction
Proper anonymization, particularly for protecting highly sensitive or personal data, is a phenomenon of great concern to a number of parties (Aggarwal et al., 2006; Diaz et al., 2002). This concern is amplified in modern times, where new technologies allow such information to flow seamlessly from one side of the globe to the next, with very little delay. To compound matters, a new wave of socially-enabled technologies has arisen, setting a standard for the open dissemination of user information, whether it be sharing reviews for a recent meal, purchase details, daily thoughts and activities, demographic information, or just about anything else. It is clear, too, that users are willing to provide this information, as witnessed by the explosive growth of these services and their active user count and service utilization statistics.
Concerns for anonymity aside, the publishing of these sorts of data can be a fruitful endeavor. It allows information to be explored and utilized in new and interesting ways, leading to further innovations and technological or methodological breakthroughs that help illuminate the true potential of the data being worked with. For organizations choosing to publish such data, however, there is a concern that information they do not want to reveal may be made public, and could potentially be traced back to them or their users. This, in turn, can dissuade organizations from making various user data available to third-parties (Kelly et al., 2008).
To address this concern, several methods for anonymizing large sets of data have been described in the academic literature, some rudimentary and others more nuanced. The underlying problem with each of the known approaches, however, is that there is no perfect way to assess what useful information is being (potentially unnecessarily) lost in the anonymization process. This results in a paradigm where, as anonymity increases, the utility of the data decreases.
In this article, we offer an innovative new architecture for deploying existing algorithmic anonymization approaches. We refer to this set of ideas as intelligent data brokerages. These brokerages are intermediary services that operate between the third-party requesting information from a provider, and the provider itself. Whereas in a traditional context, users would simply download aggregate raw data sets that they could then work with to achieve their desired outcome, with data brokerages, users model their goals in terms of a predefined formal language, and the brokerage provides them with output relevant to their specific request, bypassing the bulk release of data altogether. The brokerage can further apply various anonymization policies to govern how such results are disclosed to the user, and to control the unintended release of sensitive information. When it is essential to respond to a user’s requests for information with more expansive views of the data, the brokerage may employ traditional anonymization techniques to ensure the privacy of data subsets prior to disclosure. In doing so, intelligent data brokerages represent a general-purpose architecture for enforcing data privacy constraints, in contrast with other privacy-oriented broker models that tend to operate within a specific domain, such as digital advertising (see, for example: Narayanaswami et al., 2008; Guha et al., 2011), and which, themselves, have received limited attention in the existing literature.
The remainder of this paper is structured as follows: we begin with a review of current anonymity methods and motivators. We then formally introduce the idea of the intelligent data brokerage, its structure, subsystems, and capabilities. After this, we consider a sample use case for the intelligent data brokerage, as well as provide an experimental demonstration of a rudimentary prototype, before concluding with the contributions and takeaways of our work.