Kernel Generative Topographic Mapping of Protein Sequences

Kernel Generative Topographic Mapping of Protein Sequences

M.I. Cardenas (Universitat Politècnica de Catalunya, Spain), A. Vellido (Universitat Politècnica de Catalunya, Spain), I. Olier (The University of Manchester, UK), X. Rovira (Institut de Neurociències, Universitat Autònoma de Barcelona, Spain) and J. Giraldo (Institut de Neurociències, Universitat Autònoma de Barcelona, Spain)
DOI: 10.4018/978-1-4666-1803-9.ch013
OnDemand PDF Download:
No Current Special Offers


The world of pharmacology is becoming increasingly dependent on the advances in the fields of genomics and proteomics. The –omics sciences bring about the challenge of how to deal with the large amounts of complex data they generate from an intelligence data analysis perspective. In this chapter, the authors focus on the analysis of a specific type of proteins, the G protein-couple receptors, which are the target for over 15% of current drugs. They describe a kernel method of the manifold learning family for the analysis of protein amino acid symbolic sequences. This method sheds light on the structure of protein subfamilies, while providing an intuitive visualization of such structure.
Chapter Preview


It has been just over 10 years since the publication of the first draft of the human genome decoding. The detailed description of the human genome is a milestone for science in general and for medicine in particular. It has opened the doors to new approaches to the investigation pathologies that hold the promise of the advent of truly personalized medicine. Through these doors, though, a new challenge for intelligent data analysis has also entered.

Over the last decade, medicine has become a data-intensive area of research. One in which new data-acquisition technologies and a wider variety of investigative goals coalesce to make it one of the most important challenges for intelligent data analysis (Lisboa et al., 2004). The -omic’s sciences have contributed the most to this data deluge, stemming from microarrays in genomics, protein chips and tissue arrays in proteomics, etc. As very explicitly reported in (Kahn, 2011): “[...] the need to process terabytes of information has become the rigueur for many labs engaged in genomic research.”

Arguably, drug research has contributed more to the progress of medicine during the past century than any other scientific factor (Drews, 2000). One of the main areas of drug research is related to the analysis of proteins. The function of the proteins depends directly on their 3D structure, which is embodied in their amino acid sequence. Such 3D structure is difficult to unravel, though. Alternatively, protein sequences can be the direct object of our analysis, and they are easy to acquire. The analysis of the gene-family distribution of targets by drug substance reveals that more than 50% of drugs target only four key gene families, from which almost the 30% correspond to the G protein-coupled receptors (GPCRs) family. This family regulates the function of most cells in living organisms and is the focus of the work reported in this chapter. The grouping of GPCRs into types and subtypes based on sequence analysis may significantly contribute to helping drug design and to a better understanding of the molecular processes involved in receptor signaling both in normal and pathological conditions.

The challenge of managing the complexity of these types of data invites us to go one step further than traditional statistics and resort to intelligent pattern recognition approaches. In particular, statistical pattern recognition and machine learning methods bear the potential to both scales well to large databases and to deal with non-trivial types of data. Sound statistical principles are essential to trust the evidence base built with any computational analysis of medical data (Lisboa, 2002). Statistical machine learning methods are already establishing themselves in the more general field of bioinformatics (Baldi, 2001).

This work is specifically motivated by the need of defining a robust probabilistic method for grouping and visualizing symbolic protein sequences. As mentioned in (Schölkopf, Tsuda & Vert, 2004), there is no biologically-relevant manner of representing the symbolic sequences describing proteins using real-valued vectors. This does not preclude the possibility of assessing the similarity between such sequences. Kernel methods can be used to this purpose if understood as similarity measures.

In the following sections, we report our work on grouping and visualization of GPCR protein sequences using a kernel variant of a nonlinear model of the manifold learning family. A suitable kernel for this type of data is described. The visualization of the sequence data and the grouping results can be a useful tool in the quest for interpretability. The reported results reinforce the veracity of this statement.

Complete Chapter List

Search this Book: