Privacy-preserving data mining (PPDM) refers to the area of data mining that seeks to safeguard sensitive information from unsolicited or unsanctioned disclosure. Most traditional data mining techniques analyze and model the data set statistically, in aggregated form, while privacy preservation is primarily concerned with protecting against disclosure of individual data records. This domain separation points to the technical feasibility of PPDM. Historically, issues related to PPDM were first studied by the national statistical agencies interested in collecting private social and economical data, such as census and tax records, and making it available for analysis by public servants, companies, and researchers. Building accurate socioeconomical models is vital for business planning and public policy. Yet, there is no way of knowing in advance what models may be needed, nor is it feasible for the statistical agency to perform all data processing for everyone, playing the role of a trusted third party. Instead, the agency provides the data in a sanitized form that allows statistical processing and protects the privacy of individual records, solving a problem known as privacypreserving data publishing. For a survey of work in statistical databases, see Adam and Wortmann (1989) and Willenborg and de Waal (2001).
Survey Of Approaches
The naïve approach to PPDM is “security by obscurity,” where algorithms have no proven privacy guarantees. By its nature, privacy preservation is claimed for all data sets and attacks of a certain class, a claim that cannot be proven by examples or informal considerations (Chawla, Dwork, McSherry, Smith, & Wee, 2005). We will avoid further discussion of this approach in this forum. Recently, however, a number of principled approaches have been developed to enable PPDM, some listed below according to their method of defining and enforcing privacy.
Key Terms in this Chapter
Statistical Database: A database with social or economical data used for aggregate statistical analysis rather than for the retrieval of individual records.
Secure Multiparty Computation: A method for two or more parties to perform a joint computation without revealing their inputs.
Randomization: The act of haphazardly perturbing data before disclosure.
Pseudorandom Generator: An algorithm that takes a short random seed and outputs a long sequence of bits that look independent random under a certain class of tests.
De-Identification: Altering a data set to limit identity linkage.
Data Mining: The process of automatically searching large volumes of data for patterns.
Suppression: Withholding information due to disclosure constraints.
Privacy: Individuals or groups’ right to control the flow of sensitive information about themselves.
Summarization: Transforming a data set into a short summary to allow statistical analysis while losing the individual records.