Data Mining in Genome Wide Association Studies

Data Mining in Genome Wide Association Studies

Tom Burr (Los Alamos National Laboratory, USA)
Copyright: © 2009 |Pages: 7
DOI: 10.4018/978-1-60566-010-3.ch073
OnDemand PDF Download:
No Current Special Offers


The genetic basis for some human diseases, in which one or a few genome regions increase the probability of acquiring the disease, is fairly well understood. For example, the risk for cystic fibrosis is linked to particular genomic regions. Identifying the genetic basis of more common diseases such as diabetes has proven to be more difficult, because many genome regions apparently are involved, and genetic effects are thought to depend in unknown ways on other factors, called covariates, such as diet and other environmental factors (Goldstein and Cavalleri, 2005). Genome-wide association studies (GWAS) aim to discover the genetic basis for a given disease. The main goal in a GWAS is to identify genetic variants, single nucleotide polymorphisms (SNPs) in particular, that show association with the phenotype, such as “disease present” or “disease absent” either because they are causal, or more likely, because they are statistically correlated with an unobserved causal variant (Goldstein and Cavalleri, 2005). A GWAS can analyze “by DNA site” or “by multiple DNA sites. ” In either case, data mining tools (Tachmazidou, Verzilli, and De Lorio, 2007) are proving to be quite useful for understanding the genetic causes for common diseases.
Chapter Preview


A GWAS involves genotyping many cases (typically 1000 or more) and controls (also 1000 or more) at a large number (104 to 106) of markers throughout the genome. These markers are usually SNPs. A SNP occurs at a DNA site if more than one nucleotide (A, C, T, or G) is found within the population of interest, which includes the cases (which have the disease being studied) and controls (which do not have the disease). For example, suppose the sequenced DNA fragment from subject 1 is AAGCCTA and from subject 2 is AAGCTTA. These contain a difference in a single nucleotide. In this case there are two alleles (“alleles” are variations of the DNA in this case), C and T. Almost all common SNPs have only two alleles, often with one allele being rare and the other allele being common.

Assume that measuring the DNA at millions of sites for thousands of individuals is feasible. The resulting measurements for n1 cases and n2 controls are partially listed below, using arbitrary labels of the sites such as shown below. Note that DNA site 3 is a candidate for an association, with T being the most prevalent state for cases and G being the most prevalent state for controls.

123 456 789 ...

  • Case 1: AAT CTA TAT ...

  • Case 2: A* T CTC TAT …

  • Case n1: AAT CTG TAT ...

  • Control 1: AAG CTA TTA ...

  • Control 2: AAG CTA TTA ...


  • Control n2: AAG CTA TTA ...

Site 6 is also a candidate for an association, with state A among the controls and considerable variation among the cases. The * character (case 2) can denote missing data, an alignment character due to a deletion mutation, or an insertion mutation, etc. (Toivonen et al., 2000).

In this example, the eye can detect such association candidates “by DNA site.” However, suppose the collection of sites were larger and all n1 cases and n2 controls were listed, or that the analysis were “by haplotype.” In principle, the haplotype (one “half” of the genome of a paired-chromosome species such as humans) is the entire set of all DNA sites in the entire genome. In practice, haplotype refers to the sequenced sites, such as those in a haplotype mapping (HapMap, 2005) involving SNPs as we focus on here. Both a large “by DNA site” analysis and a haplotype analysis, which considers the joint behavior of multiple DNA sites, are tasks that are beyond the eye’s capability.

Complete Chapter List

Search this Book: