Data mining has evolved from a need to make sense of the enormous amounts of data generated by organizations. But data mining comes with its own cost, including possible threats to the confidentiality and privacy of individuals. This chapter presents a background on privacy-preserving data mining (PPDM) and the related field of statistical disclosure limitation (SDL). We then focus on privacy-preserving estimation (PPE) and the need for a data-centric approach (DCA) to PPDM. The chapter concludes by presenting some possible future trends.
The maturity of information, telecommunications, storage and database technologies, have facilitated the collection, transmission and storage of huge amounts of raw data, unimagined until a few years ago. For raw data to be utilized, they must be processed and transformed into information and knowledge that have added value, such as helping to accomplish tasks more effectively and efficiently. Data mining techniques and algorithms attempt to aid decision making by analyzing stored data to find useful patterns and to build decision-support models. These extracted patterns and models help to reduce the uncertainty in decision-making environments.
Frequently, data may have sensitive information about previously surveyed human subjects. This raises many questions about the privacy and confidentiality of individuals (Grupe, Kuechler, & Sweeney, 2002). Sometimes these concerns result in people refusing to share personal information, or worse, providing wrong data.
Many laws emphasize the importance of privacy and define the limits of legal uses of collected data. In the healthcare domain, for example, the U.S. Department of Health and Human Services (DHHS) added new standards and regulations to the Health Insurance Portability and Accountability Act of 1996 (HIPAA) to protect “the privacy of certain individually identifiable health data” (HIPAA, 2003). Grupe et al. (2002, Exhibit 1, p. 65) listed a dozen privacy-related legislative acts issued between 1970 and 2000 in the United States.
On the other hand, these acts and concerns limit, either legally and/or ethically, the releasing of datasets for legitimate research or to obtain competitive advantage in the business domain. Statistical offices face a dilemma of legal conflict or what can be called “war of acts.” While they must protect the privacy of individuals in their datasets, they are also legally required to disseminate these datasets. The conflicting objectives of the Privacy Act of 1974 and the Freedom of Information Act is just one example of this dilemma (Fienberg, 1994). This has led to an evolution in the field of statistical disclosure limitation (SDL), also known as statistical disclosure control (SDC).
SDL methods attempt to find a balance between data utility (valid analytical results) and data security (privacy and confidentiality of individuals). In general, these methods try to either (a) limit the access to the values of sensitive attributes (mainly at the individual level), or (b) mask the values of confidential attributes in datasets while maintaining the general statistical characteristics of the datasets (such as mean, standard deviation, and covariance matrix). Data perturbation methods for microdata are one class of masking methods (Willenborg & Waal, 2001).