Bioinformatics Clouds for High-Throughput Technologies

Bioinformatics Clouds for High-Throughput Technologies

Claudia Cava, Francesca Gallivanone, Christian Salvatore, Pasquale Anthony Della Rosa, Isabella Castiglioni
DOI: 10.4018/978-1-4666-5864-6.ch020
(Individual Chapters)
No Current Special Offers


Bioinformatics traditionally deals with computational approaches to the analysis of big data from high-throughput technologies as genomics, proteomics, and sequencing. Bioinformatics analysis allows extraction of new information from big data that might help to better assess the biological details at a molecular and cellular level. The wide-scale and high-dimensionality of Bioinformatics data has led to an increasing need of high performance computing and repository. In this chapter, the authors demonstrate the advantages of cloud computing in Bioinformatics research for high-throughput technologies.
Chapter Preview


High-throughput technologies produces an enormous amount of data that comes from the use of gene expression microarrays (Schena et al., 1995; Lipshutz et al., 1995), proteomics (Mann et al., 1999), and DNA sequencing (Lander et al., 2001; Venter et al., 2001).

Laboratories submit and archive their data to big archival databases such as GenBank at the National Center for Biotechnology Information (NCBI) (Benson et al., 2005), the European Bioinformatics Institute EMBL database (Brooksbank et al., 2010), the DNA Data Bank of Japan (DDBJ) (Sugawara et al., 2010), the Short Read Archive (SRA) (Shumway et al., 2010), the Gene Expression Omnibus (GEO) (Barrett et al., 2009) and the microarray database ArrayExpress (Kapushesky et al., 2010). These databases maintain, organize and distribute big data to the scientific community for Bioinformatics analysis. For instance, the public data repository GEO contains hundreds of thousands of microarray samples and supports many billions of analysis. So, in the traditional current Praxis, Bioinformatics researchers download data from these databases and run analyses on in-house computer resources.

With significant advances in high-throughput technologies and consequently the exponential growth of biological data, Bioinformatics encounters difficulties in storage and analysis of these immense volumes of data. Mainly, the gap between high-throughput experimental technologies and computer capabilities in dealing with such big data is increasing.

At present, a promising solution to obtain the power and scale of computation is cloud computing, which uses the full potential of multiple computers and delivers analysis and repository as dynamically allocated virtual resources via the Internet.

The present chapter deals with cloud-based services and presents the advantages (and in some case disadvantages) for big data storage and analysis issues in Bioinformatics, such as data sharing, applications and time-critical calculations:

  • Data Sharing and Security: Public datasets change frequently and dynamically, causing problems in both archiving and sharing data for a long time. Data repositories often disappears from the public domain (e.g. due to cancelation policies for limited space) allowing users to perform partial analysis. Cloud Computing can be a solution for permanent resources where big data sets can be archived and easily accessed without necessarily copying it to another computer resources.

  • Bioinformatics Applications: Public datasets may be analyzed with standard tools for Bioinformatics, such as Significance Analysis of Microarrays (SAM) (Tusher et al., 2001), TM4 Multiple Expression Viewer (Saeed et al., 2006), GenePattern (Reich et al., 2006), and Bioconductor (Gentleman et al., 2004). In many cases it requires local installation and problem of maintenances and updates. Cloud Computing escapes it.

Time-critical calculations and scalability. Complex tasks that require data management are critical on clouds. Two framework ‘MapReduce and Hadoop Distributed File System (HDFS)’ (Taylor et al., 2010) are capable of performing time critical calculation using parallelized analysis.

In particular, cloud computing services in Bioinformatics belong to four major categories:

Key Terms in this Chapter

Microarray: A hybridization technique of a nucleic acid sample (target) to a very large set of oligonucleotide probes, which are attached to a solid support. It used to determine sequences, to detect variations in a gene sequence or to measure the expression levels of large numbers of genes simultaneously.

Sequence Alignment: A process of arranging the sequences of DNA, RNA, or protein to discover regions of similarity that may be an effect of functional, structural, or evolutionary relationships between the sequences.

Data Sharing: The method of making data used for your research available to others through a variety of mechanisms.

Genome-Wide Association (GWA): An approach that involves rapidly scanning markers across the complete sets of DNA of many people that occur more frequently in people with a particular disease.

Basic Local Alignment Search Tool (BLAST): An algorithm to find regions of local similarity between sequences. The algorithm compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches.

Next Generation Sequencing (NGS): also known as high-throughput sequencing, allow to sequence DNA and RNA much more quickly than the previous sequencing methods.

Single-Nucleotide Polymorphism (SNP): A DNA sequence variation occurring when a single nucleotide in the genome differs between members of a biological species or disease.

High-Throughput Technologies: The generic name to indicate the technologies that allow exact and simultaneous examinations of thousands of genes, proteins and metabolites.

Protein Folding: The process by which a protein structure assumes its functional shape or conformation. To carry out their functions, proteins must fold into a complex three-dimensional structure.

Complete Chapter List

Search this Book: