Privacy Preserving Principal Component Analysis Clustering for Distributed Heterogeneous Gene Expression Datasets

Privacy Preserving Principal Component Analysis Clustering for Distributed Heterogeneous Gene Expression Datasets

Xin Li (Georgetown University Medical Center, USA)
DOI: 10.4018/jcmam.2011100102
OnDemand PDF Download:
No Current Special Offers


In this paper, we present approaches to perform principal component analysis (PCA) clustering for distributed heterogeneous genomic datasets with privacy protection. The approaches allow data providers to collaborate together to identify gene profiles from a global viewpoint, and at the same time, protect the sensitive genomic data from possible privacy leaks. We then further develop a framework for privacy preserving PCA-based gene clustering, which includes two types of participants: data providers and a trusted central site (TCS). Two different methodologies are employed: Collective PCA (C-PCA) and Repeating PCA (R-PCA). The C-PCA requires local sites to transmit a sample of original data to the TCS and can be applied to any heterogeneous datasets. The R-PCA approach requires all local sites have the same or similar number of columns, but releases no original data. Experiments on five independent genomic datasets show that both C-PCA and R-PCA approaches maintain very good accuracy compared with the centralized scenario.
Article Preview


In recent years, new bioinformatics technologies, such as gene expression microarray, genome-wide association study, proteomics, and metabolomics, have been widely used to simultaneously identify a huge number of human genomic/genetic biomarkers, generate a tremendously large amount of data, and dramatically increase the knowledge on human genomic/genetic information, thus significantly improving biomedical research. However, these exciting advances in bioinformatics do come with a drawback: the increasingly richer human genomic/genetic data contains sensitive private information, such as genetic markers, diseases, etc., which may further lead to the discovery of the individual’s race, family, or even identity. Therefore, privacy is an important issue when dealing with bioinformatics data. This is further exacerbated when multiple data providers try to collaborate with each other.

Gene Expression and DNA Microarray

A DNA microarray (Wikipedia, 2010), which is the practical realized technology of the Gene Expression (, 2010), is a multiplex technology used in molecular biology. It consists of an arrayed series of thousands of microscopic spots of DNA oligonucleotides, called features, each containing picomoles (10-12 moles) of a specific DNA sequence, known as probes (or reporters). This can be a short section of a gene or other DNA element that is used to hybridize a cDNA or cRNA sample (called target) under high-stringency conditions. Probe-target hybridization is usually detected and quantified by detection of fluorophore-, silver-, or chemiluminescence-labeled targets to determine relative abundance of nucleic acid sequences in the target. Since an array can contain tens of thousands of probes, a microarray experiment can accomplish many genetic tests in parallel. Therefore arrays have dramatically accelerated many types of investigation. The microarray data processing pipeline (Hackl, Sanchez Cabo, Sturn, Wolkenhauer, & Trajanoski, 2004) includes a variety of statistical steps: pre-processing (including background correction, normalization, and summarization), differential analysis which contains raw p-value computation and false discovery rate (FDR) correction, and gene clustering / profiling analysis. Figure 1(a) shows that microarray experiment process in the lab and Figure 1(b) illustrates its gene clustering result.

Figure 1.

(a) DNA microarray experiment lab processing flow; (b) the gene clustering result (heatmap) of a microarray experiment


Gene Clustering on Collaborative Datasets on Vertical Partitions

Due to the fact that limited technical resources are available of a single research group or institution, researchers are often required to combine multiple gene expression datasets from different research labs/groups/institutions, and to conduct meta-analysis (Griffith et al., 2006; Lu, 2009; Ramasamy et al., 2008) or pooled studies (Szelinger et al., 2011; Wei et al., 2009), which can give them a big picture of gene profiling from a global perspective.

Another situation occurs when multiple labs, groups, and/or institutes perform different treatments on samples, seeking to formulate the gene profile for identical groups or genes from a global perspective. Several meta-analysis methods (Fishel et al., 2007; Yang et al., 2007) have been developed to handle such analysis, specifically if the same set of genes is studied under different treatments done at multiple sites. Successful cross-experiment gene clustering analyses have been completed on a variety of cancers (lung, breast, liver, etc.) to enhance the pathway (Dawany et al., 2010; Shen et al., 2010) or to build global gene profiles (Bianchi et al., 2007; Tseng et al., 2009). This situation results in a vertical partitioning scenario, where multiple datasets have the same rows, but different columns.

Complete Article List

Search this Journal:
Volume 4: 2 Issues (2014)
Volume 3: 4 Issues (2012)
Volume 2: 4 Issues (2011)
Volume 1: 4 Issues (2010)
View Complete Journal Contents Listing