Gene Set- and Pathway- Centered Knowledge Discovery Assigns Transcriptional Activation Patterns in Brain, Blood, and Colon Cancer: A Bioinformatics Perspective

Gene Set- and Pathway- Centered Knowledge Discovery Assigns Transcriptional Activation Patterns in Brain, Blood, and Colon Cancer: A Bioinformatics Perspective

Lilit Nersisyan (Institute of Molecular Biology, National Academy of Sciences and College of Science and Engineering, American University of Armenia, Yerevan, Armenia), Henry Löffler-Wirth (Interdisciplinary Centre for Bioinformatics (IZBI), Leipzig University, Leipzig, Germany), Arsen Arakelyan (Institute of Molecular Biology, National Academy of Sciences and College of Science and Engineering, American University of Armenia, Yerevan, Armenia) and Hans Binder (Interdisciplinary Centre for Bioinformatics (IZBI), Leipzig University, Leipzig, Germany)
Copyright: © 2014 |Pages: 24
DOI: 10.4018/IJKDB.2014070104
OnDemand PDF Download:
$37.50

Abstract

Genome-wide ‘omics'-assays provide a comprehensive view on the molecular landscapes of healthy and diseased cells. Bioinformatics traditionally pursues a ‘gene-centered' view by extracting lists of genes differentially expressed or methylated between healthy and diseased states. Biological knowledge mining is then performed by applying gene set techniques using libraries of functional gene sets obtained from independent studies. This analysis strategy neglects two facts: (i) that different disease states can be characterized by a series of functional modules of co-regulated genes and (ii) that the topology of the underlying regulatory networks can induce complex expression patterns that require analysis methods beyond traditional genes set techniques. The authors here provide a knowledge discovery method that overcomes these shortcomings. It combines machine learning using self-organizing maps with pathway flow analysis. It extracts and visualizes regulatory modes from molecular omics data, maps them onto selected pathways and estimates the impact of pathway-activity changes. The authors illustrate the performance of the gene set and pathway signal flow methods using expression data of oncogenic pathway activation experiments and of patient data on glioma, B-cell lymphoma and colorectal cancer.
Article Preview

1. Introduction

Cancer research has rapidly embraced high throughput technologies including microarray and next generation sequencing platforms. The result has been an explosion in the volume of biological data collected during the course of biomedical research. These data might provide insights into the biology of cancer with exciting perspectives tailoring prevention, diagnosis, and treatment based on the molecular characteristics of a patient’s disease. Extraction of this particular information from the data is generally subsumed as knowledge discovery (Holzinger, Dehmer, & Jurisica, 2014). More concretely, high throughput data processing requires a series of sequential steps such as (i) data collection and selection, (ii) data preprocessing and ‘cleansing’ for different kinds of artefacts, (iii) sorting and classifying the data, (iv) mining the data for biological information including their visualization and (v) evaluating and interpreting them. Within this sequence of steps the contribution of biological knowledge and expertise progressively increases. This biological knowledge must be provided in a form that it can be linked with the data using informatics, as well as statistics to evaluate its relevance.

In this publication we focus on step (iv), which can be understood as knowledge discovery in a narrower sense, namely as the extraction of useful knowledge from a collection of data. In the context of -omics data in cancer data analyses, it represents the basic step that links biological information with the transformed, i.e. sorted and classified data. More concretely, we will describe here two techniques (Figure 1): The first one is gene set analysis (GSA) which is well established since many years (Subramanian et al., 2005). It analyzes lists of signature genes obtained from step (iii) (i.e. sorting and classifying the data) in terms of previous knowledge, which is taken from a database of genes with assigned biological function. The second ‘pathway signal flow’ (PSF) method makes use of the interactions between the genes which is not explicitly taken into account in GSA-based methods (Arakelyan, 2013). These interactions are processed as sequential ‘information flows’ through pathways formed by the genes involved. Both techniques enable knowledge discovery either in collections of signature genes or in predefined pathways. As original data, we use here gene expression data of different patient studies on cancer and of cell line experiments on oncogenic mechanisms. As the data transformation technique (iii) to sort and classify the data we here apply machine learning using self-organizing maps (SOM) (Kohonen, 1982), which has been proven as a very efficient method to discover the intrinsic multidimensional structure of complex and massive data (H Binder, Hopp, Lembcke, & Wirth, 2015; H Binder & Wirth, 2014). After a short description of the methodical background the major part of the paper is devoted to their application to different cancer-related data sets. Step (v) (interpretation and evaluation) will only be an exemplarily tangent to illustrate the final step of data analysis.

Figure 1.

Knowledge mining in large scale omics data – an overview

Complete Article List

Search this Journal:
Reset
Open Access Articles: Forthcoming
Volume 7: 2 Issues (2017): 1 Released, 1 Forthcoming
Volume 6: 2 Issues (2016)
Volume 5: 2 Issues (2015)
Volume 4: 2 Issues (2014)
Volume 3: 4 Issues (2012)
Volume 2: 4 Issues (2011)
Volume 1: 4 Issues (2010)
View Complete Journal Contents Listing