Article Preview
TopIntroduction
Sequencing technology has come a long way since the time when traditional sequencing techniques required many labs around the world to cooperate for years in order to sequence the human genome for the first time. The traditional Sanger-based sequencing methods, developed in the mid 70’s, had been the workhorse technology for DNA sequencing for almost 30 years (Sanger & Coulson, 1975; Sanger et al., 1977).
Nowadays, next-generation sequencing technologies have reduced the task of sequencing a whole genome to a matter of days, or even hours, and the cost has decreased by orders of magnitude, making it an accessible experimental procedure to many labs (ten Bosch & Grody, 2008). This opened the door for re-sequencing to start becoming a more routine procedure, as it finds many applications in the detection of genetic variability among individuals. Thus, it can help us understand the extent of that variability, and also identify specific variants, alternative splicing sites and patterns, epigenetic effects, and relate them to gene regulation and expression, as well as to diseases (1000 Genomes, 2011; Wu & Nacu, 2010, Xiang et al., 2010; Ng et al., 2010). Thus, DNA sequencing is quickly becoming a powerful tool in diagnostic medicine, and eventually personalised treatment (ten Bosch & Grody, 2008).
The data resulting from a single sequencing experiment can be massive; it is not uncommon to have data from multiple experiments. This trend of increasing availability of sequencing data will continue as projects even more ambitious than the 1000 Genomes Project (1000 Genomes, 2011) start to materialize. According to their respective websites, typical output sizes of the three main next-generation sequencing platforms – 454/Roche, ABI SOLiD, and Illumina GA – are millions of reads ranging in size from 25bp to 400bp. In most cases, these reads are too short to be directly assembled, especially in the presence of repetitive regions (Miller et al., 2010), therefore a reference sequence is usually required.
Mapping so many short reads onto a reference sequence is a very challenging task that cannot be adequately carried out by traditional search and alignment algorithms (Kent, 2002) like BLAST (Altschul et al., 1990) and FASTA (Pearson & Lipman, 1988), so a broad array of programmes (Jiang & Wong, 2008; Li et al., 2009; Langmead et al., 2009; Li & Durbin, 2009; Frousios et al., 2010) has been published to address this task, placing emphasis on different aspects of the challenge. The different algorithms implement various combinations of innovations and trade-offs, to address computing speed, system resources requirements, and biological relevance and accuracy of the computed results.
Unlike the linear DNA of vertebrates, strain or species of bacteria with circular organization of their chromosomes or plasmids, are the most common. Until towards the end of the 1980s, when the technology for examining chromosomes and plasmids improved, all bacteria were thought to have a single circular chromosome (Colem & Saint-Girons, 1999). In fact, not all bacteria have a single circular chromosome; some bacteria have multiple circular chromosomes (Suwanto & Kaplan, 1989a, 1989b, 1992a, 1992b), and many bacteria have linear chromosomes and linear plasmids (Volff & Altenbuchner, 2000). Bacterial genomes range in size from about 160, 000bp to 12, 200, 000bp, depending on the type considered (Nakabachi et al., 2006).