Graph Representation and Anonymization in Large Survey Rating Data

Graph Representation and Anonymization in Large Survey Rating Data

Xiaoxun Sun (Australian Council for Educational Research, Australia) and Min Li (University of Southern Queensland, Australia)
Copyright: © 2012 |Pages: 19
DOI: 10.4018/978-1-61350-053-8.ch014
OnDemand PDF Download:


We study the challenges of protecting privacy of individuals in the large public survey rating data in this chapter. Recent study shows that personal information in supposedly anonymous movie rating records is de-identified. The survey rating data usually contains both ratings of sensitive and non-sensitive issues. The ratings of sensitive issues involve personal privacy. Even though the survey participants do not reveal any of their ratings, their survey records are potentially identifiable by using information from other public sources. None of the existing anonymisation principles can effectively prevent such breaches in large survey rating data sets. We tackle the problem by defining a principle called (k, e)-anonymity model to protect privacy. Intuitively, the principle requires that, for each transaction t in the given survey rating data T, at least (k - 1) other transactions in T must have ratings similar to t, where the similarity is controlled by e. The (k, e)-anonymity model is formulated by its graphical representation and a specific graph-anonymisation problem is studied by adopting graph modification with graph theory. Various cases are analyzed and methods are developed to make the updated graph meet (k, e) requirements. The methods are applied to two real-life data sets to demonstrate their efficiency and practical utility.
Chapter Preview


The structure of large survey rating data is different from relational data, since it does not have fixed personal identifiable attributes. The lack of a clear set of personal identifiable attributes makes the anonymisation challenging (Ghinita et al. 2008, Xu et al. 2008, Zhou et al. 2008). In addition, survey rating data contains many attributes, each of which corresponds to the response to a survey question, but not all participants need to rate all issues (or answer all questions), which means a lot of cells in a data set are empty. For instance, Figure 1(a) is a published survey rating data set containing ratings of survey participants on both sensitive and non-sensitive issues. The higher the rating is, the more preferred the participant is towards the issue. “null” means the participant did not rate the issue. Figure 1(b) contains comments on non-sensitive issues of some survey participants, which might be obtained from public information sources such as personal weblogs or social network.

Complete Chapter List

Search this Book: