With the ability to collect massive amounts of data as a result of technological advances, we are commonly faced with data sets from multiple sources. For instance, metabolomics studies focus on detection of a wide range of chemical substances in biological fluids such as urine and plasma using a number of analytical techniques including Liquid Chromatography-Mass Spectrometry (LC-MS) and Nuclear Magnetic Resonance (NMR) Spectroscopy. NMR, for example, is a highly reproducible technique and powerful in terms of quantification. LC-MS, on the other hand, allows the detection of many more chemical substances in biological fluids but only with lower reproducibility. These techniques often generate data sets that are complementary to each other (Richards et al., 2010). Data from these complementary methods, when analyzed together, may enable us to capture a larger proportion of the complete metabolome belonging to a specific biological system. However, currently, there is a significant gap between data collection and knowledge extraction: being able to collect a vast amount of relational data from multiple sources, we cannot still analyze these data sets in a way that shows the overall picture of a specific problem of interest, e.g., exposure to a specific diet.
To address this challenge, data fusion methods have been developed in various fields focusing on specific problems of interest, e.g., missing link prediction in recommender systems (Ma et al., 2008), and clustering/community detection in social network analysis (Banerjee et al., 2007; Lin et al., 2009). Data fusion has also been studied in metabolomics mostly with a goal of capturing the underlying patterns in data (Smilde et al., 2003) and using the extracted patterns for prediction of a specific condition (Doeswijk et al., 2011) (see Richards et al., 2010) for a comprehensive review on data fusion in omics).
Matrix factorizations are the common tools in data fusion studies in different fields. An effective way of jointly analyzing data from multiple sources is to represent data from different sources as a collection of matrices. Subsequently, this collection of matrices can be jointly analyzed using collective matrix factorization methods (Long et al., 2006; Singh & Gordon, 2008).
Nevertheless, applicability of available data fusion techniques is limited when the goal is to identify a limited number of variables, e.g., a few metabolites as potential biomarkers. Matrix factorization methods, without specific constraints on the factors, would reveal dense patterns, which are difficult to interpret. Therefore, motivated by the applications in metabolomics, in this paper, we formulate data fusion as a coupled matrix factorization model with penalties to enforce sparsity on the factors in order to capture sparse patterns. Our contributions in this paper can be summarized as follows:
Formulating a coupled matrix factorization model with penalties to impose sparsity on factor matrices;
Developing a gradient-based optimization algorithm for solving the smooth approximation of the coupled matrix factorization problem with sparsity penalties, which we call CMF-SPOPT (Coupled Matrix Factorization with SParse OPTimization);
Demonstrating the effectiveness of CMF-SPOPT in terms of capturing the underlying sparse patterns in data using simulations;
Assessing the sensitivity of the proposed approach to different penalty parameters;
Identifying potential apple biomarkers based on joint analysis of metabolomics data sets collected on blood samples of a group of rats.
This is an extended version of our previous study (Acar et al., 2012), where we have imposed the same level of sparsity on coupled data sets. In this paper, we also demonstrate that the proposed approach extends to different levels of sparsity in coupled data sets and can accurately capture the underlying sparse factors using different sparsity penalties for different data sets. Furthermore, through illustrative examples on real metabolomics data sets, we demonstrate the strengths and weaknesses of CMF-SPOPT.