While the rise of microarrays has heralded a new era in molecular biology with its ability to measure the expression level of thousands of genes at once, the usefulness of microarrays is exigent upon the ability to obtain accurate gene expression data for the individual genes (Bowtell, 1999; Brown & Botstein, 1999; Cheung, Morley, Aguilar, Massimi, Kucherlapati, & Childs, 1999). However, there has been significant criticism as to how meaningful the information derived via microarrays is. In cases where one has attempted to find genes that correlated to types of cancer or survival rate, it was found that different analysis techniques would often times yield radically different set of genes, calling into question the validity of the overall experiment itself (Dupuy & Simon, 2007). It is our contention that part of the problem associated with microarrays is that there does not exist a coherent method for dealing with data quality, and if a coherent method for dealing with data quality existed, many of the criticisms of microarrays could be addressed.
Key Terms in this Chapter
Natural P-Value: The p-value a researcher should set in determining statistical significance. It is wholly reliant upon the number of samples in a given trial. Therefore, in a microarray, the natural p-value should be set to 1/N where N is the number of samples
Locally Weighted Normalization of a Scatter Plot (LOESS:LOWESS): LOESS seeks to find a low order polynomial that best describes the overall variation in a scatter plot. This is used to normalize for the nonlinearities found in two state experiments.
Signal to Noise Ratio (SNR): In the context of microarrays, the noise comes from two sources, technical and biological. This is, however, the primary determinant of how many replicates are required but is complicated via the fact that different probes have different SNR.
Position Dependent Nearest Neighbor (PDNN) Model: A normalization technique by Zhang, Miles, and Aldape (2003), which makes the assumption that the signal intensity is dependent on both the probe sequence being used and the number of mRNA copies. It performs the normalization by calculating the number of mRNA copies and an expected signal intensity by optimizing for various parameters such as base stacking energy.
Significance Analysis of Microarrays (SAM): A selection algorithm which is nominally very similar to that of the t-test. It is, however, more robust to mRNA signals of lower SNR and hence gives more reliable filtering for genes of low expression levels.
Biologically Informative: This is different from the notion of statistically significant because this set of genes is consistent over multiple experiments, replicates, and microarray platforms and reflects the underlying ground truth.
Statistically Significant: The ability for the variability of a sample to be attributed by a factor other than through random noise. This is dependent first upon the overall distribution of the samples, though most researchers assume that the random variations are gaussian. Due to systematic factors such as dye binding affinities as well as the nonlinear binding behavior in microarrays, normalization is required before the use of this gaussian assumption