Coupled Matrix Factorization with Sparse Factors to Identify Potential Biomarkers in Metabolomics

Coupled Matrix Factorization with Sparse Factors to Identify Potential Biomarkers in Metabolomics

Evrim Acar (Department of Food Science, Faculty of Science, University of Copenhagen, Copenhagen, Denmark), Gozde Gurdeniz (Department of Human Nutrition, Faculty of Science, University of Copenhagen, Copenhagen, Denmark), Morten A. Rasmussen (Department of Food Science, Faculty of Science, University of Copenhagen, Copenhagen, Denmark), Daniela Rago (Department of Human Nutrition, Faculty of Science, University of Copenhagen, Copenhagen, Denmark), Lars O. Dragsted (Department of Human Nutrition, Faculty of Science, University of Copenhagen, Copenhagen, Denmark) and Rasmus Bro (Department of Food Science, Faculty of Science, University of Copenhagen, Copenhagen, Denmark)
Copyright: © 2012 |Pages: 22
DOI: 10.4018/jkdb.2012070102
OnDemand PDF Download:
$37.50

Abstract

Metabolomics focuses on the detection of chemical substances in biological fluids such as urine and blood using a number of analytical techniques including Nuclear Magnetic Resonance (NMR) spectroscopy and Liquid Chromatography-Mass Spectrometry (LC-MS). Among the major challenges in analysis of metabolomics data are (i) joint analysis of data from multiple platforms, and (ii) capturing easily interpretable underlying patterns, which could be further utilized for biomarker discovery. In order to address these challenges, the authors formulate joint analysis of data from multiple platforms as a coupled matrix factorization problem with sparsity penalties on the factor matrices. They developed an all-at-once optimization algorithm, called CMF-SPOPT (Coupled Matrix Factorization with SParse OPTimization), which is a gradient-based optimization approach solving for all factor matrices simultaneously. Using numerical experiments on simulated data, the authors demonstrate that CMF-SPOPT can capture the underlying sparse patterns in data. Furthermore, on a real data set of blood samples collected from a group of rats, the authors use the proposed approach to jointly analyze metabolomics data sets and identify potential biomarkers for apple intake. Advantages and limitations of the proposed approach are also discussed using illustrative examples on metabolomics data sets.
Article Preview

Introduction

With the ability to collect massive amounts of data as a result of technological advances, we are commonly faced with data sets from multiple sources. For instance, metabolomics studies focus on detection of a wide range of chemical substances in biological fluids such as urine and plasma using a number of analytical techniques including Liquid Chromatography-Mass Spectrometry (LC-MS) and Nuclear Magnetic Resonance (NMR) Spectroscopy. NMR, for example, is a highly reproducible technique and powerful in terms of quantification. LC-MS, on the other hand, allows the detection of many more chemical substances in biological fluids but only with lower reproducibility. These techniques often generate data sets that are complementary to each other (Richards et al., 2010). Data from these complementary methods, when analyzed together, may enable us to capture a larger proportion of the complete metabolome belonging to a specific biological system. However, currently, there is a significant gap between data collection and knowledge extraction: being able to collect a vast amount of relational data from multiple sources, we cannot still analyze these data sets in a way that shows the overall picture of a specific problem of interest, e.g., exposure to a specific diet.

To address this challenge, data fusion methods have been developed in various fields focusing on specific problems of interest, e.g., missing link prediction in recommender systems (Ma et al., 2008), and clustering/community detection in social network analysis (Banerjee et al., 2007; Lin et al., 2009). Data fusion has also been studied in metabolomics mostly with a goal of capturing the underlying patterns in data (Smilde et al., 2003) and using the extracted patterns for prediction of a specific condition (Doeswijk et al., 2011) (see Richards et al., 2010) for a comprehensive review on data fusion in omics).

Matrix factorizations are the common tools in data fusion studies in different fields. An effective way of jointly analyzing data from multiple sources is to represent data from different sources as a collection of matrices. Subsequently, this collection of matrices can be jointly analyzed using collective matrix factorization methods (Long et al., 2006; Singh & Gordon, 2008).

Nevertheless, applicability of available data fusion techniques is limited when the goal is to identify a limited number of variables, e.g., a few metabolites as potential biomarkers. Matrix factorization methods, without specific constraints on the factors, would reveal dense patterns, which are difficult to interpret. Therefore, motivated by the applications in metabolomics, in this paper, we formulate data fusion as a coupled matrix factorization model with penalties to enforce sparsity on the factors in order to capture sparse patterns. Our contributions in this paper can be summarized as follows:

  • Formulating a coupled matrix factorization model with penalties to impose sparsity on factor matrices;

  • Developing a gradient-based optimization algorithm for solving the smooth approximation of the coupled matrix factorization problem with sparsity penalties, which we call CMF-SPOPT (Coupled Matrix Factorization with SParse OPTimization);

  • Demonstrating the effectiveness of CMF-SPOPT in terms of capturing the underlying sparse patterns in data using simulations;

  • Assessing the sensitivity of the proposed approach to different penalty parameters;

  • Identifying potential apple biomarkers based on joint analysis of metabolomics data sets collected on blood samples of a group of rats.

This is an extended version of our previous study (Acar et al., 2012), where we have imposed the same level of sparsity on coupled data sets. In this paper, we also demonstrate that the proposed approach extends to different levels of sparsity in coupled data sets and can accurately capture the underlying sparse factors using different sparsity penalties for different data sets. Furthermore, through illustrative examples on real metabolomics data sets, we demonstrate the strengths and weaknesses of CMF-SPOPT.

Complete Article List

Search this Journal:
Reset
Open Access Articles: Forthcoming
Volume 7: 2 Issues (2017): 1 Released, 1 Forthcoming
Volume 6: 2 Issues (2016)
Volume 5: 2 Issues (2015)
Volume 4: 2 Issues (2014)
Volume 3: 4 Issues (2012)
Volume 2: 4 Issues (2011)
Volume 1: 4 Issues (2010)
View Complete Journal Contents Listing