Data mining has evolved from a need to make sense of the enormous amounts of data generated by organizations. But data mining comes with its own cost, including possible threats to the confidentiality and privacy of individuals. This chapter presents a background on privacy-preserving data mining (PPDM) and the related field of statistical disclosure limitation (SDL). We then focus on privacy-preserving estimation (PPE) and the need for a data-centric approach (DCA) to PPDM. The chapter concludes by presenting some possible future trends.
The maturity of information, telecommunications, storage and database technologies, have facilitated the collection, transmission and storage of huge amounts of raw data, unimagined until a few years ago. For raw data to be utilized, they must be processed and transformed into information and knowledge that have added value, such as helping to accomplish tasks more effectively and efficiently. Data mining techniques and algorithms attempt to aid decision making by analyzing stored data to find useful patterns and to build decision-support models. These extracted patterns and models help to reduce the uncertainty in decision-making environments.
Frequently, data may have sensitive information about previously surveyed human subjects. This raises many questions about the privacy and confidentiality of individuals (Grupe, Kuechler, & Sweeney, 2002). Sometimes these concerns result in people refusing to share personal information, or worse, providing wrong data.
Many laws emphasize the importance of privacy and define the limits of legal uses of collected data. In the healthcare domain, for example, the U.S. Department of Health and Human Services (DHHS) added new standards and regulations to the Health Insurance Portability and Accountability Act of 1996 (HIPAA) to protect “the privacy of certain individually identifiable health data” (HIPAA, 2003). Grupe et al. (2002, Exhibit 1, p. 65) listed a dozen privacy-related legislative acts issued between 1970 and 2000 in the United States.
On the other hand, these acts and concerns limit, either legally and/or ethically, the releasing of datasets for legitimate research or to obtain competitive advantage in the business domain. Statistical offices face a dilemma of legal conflict or what can be called “war of acts.” While they must protect the privacy of individuals in their datasets, they are also legally required to disseminate these datasets. The conflicting objectives of the Privacy Act of 1974 and the Freedom of Information Act is just one example of this dilemma (Fienberg, 1994). This has led to an evolution in the field of statistical disclosure limitation (SDL), also known as statistical disclosure control (SDC).
SDL methods attempt to find a balance between data utility (valid analytical results) and data security (privacy and confidentiality of individuals). In general, these methods try to either (a) limit the access to the values of sensitive attributes (mainly at the individual level), or (b) mask the values of confidential attributes in datasets while maintaining the general statistical characteristics of the datasets (such as mean, standard deviation, and covariance matrix). Data perturbation methods for microdata are one class of masking methods (Willenborg & Waal, 2001).
Key Terms in this Chapter
Privacy: Privacy is the desire of individuals to control their personal information. Generally, in the SDL literature, it relates to the identity of an individual, while confidentiality relates to specific information about the individual (such as salary).
Statistical Disclosure Limitation (SDL) or Statistical Disclosure Control (SDC): A set of methods that attempt to protect privacy and confidentiality of data, while preserving the overall statistical characteristics of original datasets (such as mean and covariance matrix) in the protected dataset.
Data Mining Technique: The main purpose or objective of the data mining modelling process. Each technique can be implemented using different DM algorithms.
Data Mining Algorithm: A systematic, practical method to implement a data mining technique. Different algorithms can be used to implement the same data mining technique. For example, decision trees algorithms (CART, C4.5, C5, etc.) and logistic regression are among the algorithms of the classification data mining technique.
Confidentiality: The status accorded to specific attributes (such as salary) in datasets, whose original values should not be revealed. Generally, some type of protection such as masking must be provided before these confidential attributes are disseminated.
Data-Centric Approach (DCA): The concept that data protection techniques must be independent of (standard) DM algorithms. That is, the masked data must be analyzable using multiple DM algorithms while providing results comparable to the results from analyzing the original data.