A Latent Feature Model Approach to Biclustering

A Latent Feature Model Approach to Biclustering

José Caldas (Aalto University, Espoo, Finland) and Samuel Kaski (Aalto University, Espoo, Finland)
Copyright: © 2016 |Pages: 18
DOI: 10.4018/IJKDB.2016070102
OnDemand PDF Download:
$30.00
List Price: $37.50

Abstract

Biclustering is the unsupervised learning task of mining a data matrix for useful submatrices, for instance groups of genes that are co-expressed under particular biological conditions. As these submatrices are expected to partly overlap, a significant challenge in biclustering is to develop methods that are able to detect overlapping biclusters. The authors propose a probabilistic mixture modelling framework for biclustering biological data that lends itself to various data types and allows biclusters to overlap. Their framework is akin to the latent feature and mixture-of-experts model families, with inference and parameter estimation being performed via a variational expectation-maximization algorithm. The model compares favorably with competing approaches, both in a binary DNA copy number variation data set and in a miRNA expression data set, indicating that it may potentially be used as a general-problem solving tool in biclustering.
Article Preview

Introduction

Clustering methods have been useful at understanding global trends in high-throughput molecular biology data (D’Haeseleer, 2005). Due to the increasing number of phenotypes that are probed for in individual studies, as well as the intrinsic complexity and high dimensionality of genome-wide high-throughput data, it has become increasingly meaningful to detect local, rather than global data trends. In practice, this amounts to grouping subsets of biological conditions and associate with each group the measurements that make those biological conditions similar. For instance, in a gene expression study where the messenger RNA (mRNA) level of each gene is measured in a number of samples corresponding to certain biological conditions (e.g. tumor samples from multiple disease stages), the analyst may be interested in detecting groups of samples that are similar and associate with each group of samples the genes whose measurements are similar across those samples. This unsupervised learning task is known as biclustering (Cheng & Church, 2000; Madeira & Oliveira, 2004).

Formally, biclustering is an unsupervised learning task that takes as input a data matrix D and learns a set of submatrices of D, which are designated as biclusters. Each submatrix/bicluster should contain certain desirable properties; the specific desiderata depend on the specific problem formulation and the data type of the input matrix D. For instance, in a sparse binary matrix, the analyst’s intention may be to detect biclusters that correspond to dense submatrices; alternatively, in a continuous data set with values spanning a given range, the modeller’s intention may be to detect biclusters that correspond to submatrices in which the values are typically close to each other according to a given metric (e.g. Euclidean). Throughout the present paper, we use the terms object and condition to designate the rows and columns of a data matrix. For instance, in a gene expression matrix such as the one described above, the objects are genes and the conditions are the biological samples.

Biclustering methods may be classified according to the type of bicluster structures they can detect (Madeira & Oliveira, 2004). Crucially, some but not all methods allow biclusters to overlap, that is, they allow a pair of object and condition to belong to more than one bicluster. Allowing membership to multiple biclusters is justified, accounting for instance to the multiple functional roles that a gene may undertake or the various biological processes that are simultaneously active in a biological condition. However, it brings in additional technical challenges regarding how to properly specify a model that handles bicluster overlap. A typical approach in the context of expression data is to specify a linear model wherein each bicluster corresponds to a given set of parameters; in this model family, bicluster parameters combine additively, i.e., the parameters used for modelling a given data point are obtained by summing the parameters across all biclusters that include the object-condition pair . A well-known member of this model family is the plaid model (Lazzeroni & Owen, 2002). More general frameworks combine parameter additivity with link functions (e.g., the sigmoid function) in order to model discrete data types (Meeds et al., 2007). The main drawbacks of such parameter interaction paradigms are that the specific interaction assumptions are restrictive and often may not hold; the necessity to introduce parameter interaction assumptions, e.g., additive combination of bicluster parameters, in addition to specifying how each bicluster models the data assigned to it, may lead the practitioner to postulate artificial assumptions solely for the purpose of maintaining model soundness. We propose an alternative mixture-modelling approach, leading to more straightforward solutions, in which each object-condition pair may belong to several biclusters, as long as those biclusters provide equally good models for the corresponding data points .

Complete Article List

Search this Journal:
Reset
Open Access Articles: Forthcoming
Volume 7: 2 Issues (2017)
Volume 6: 2 Issues (2016)
Volume 5: 2 Issues (2015)
Volume 4: 2 Issues (2014)
Volume 3: 4 Issues (2012)
Volume 2: 4 Issues (2011)
Volume 1: 4 Issues (2010)
View Complete Journal Contents Listing