TopBiological Data And Their Characteristics
Before embarking on a long tour across different machine learning methods, it is useful to look at some popular gene expression data sets and their characteristics.
First of all, a typical gene expression data set contains a matrix of real numbers. Let and be the number of its rows and columns, respectively. Then is represented as
where
represents the value in the
th row and
th column.
As there are thousands of gene expressions and only a few dozens of samples, (the number of genes) is of order 1,000-10,000 while (the number of biological samples) is somewhere between 10 and 100. Such a condition makes the application of many traditional statistical methods impossible as those techniques were developed under the assumption that . You may ask: what’s a problem?
The problem is in an underdetermined system where there are only a few equations versus many more unknown variables (Kohane, Kho, & Butte, 2003). Hence, the solution of such a system is not unique. In other words, multiple solutions exist. By translating this into the biological language of the applied problem treated in this book, this means that multiple subsets of genes may be equally relevant to cancer classification (Ein-Dor, Kela, Getz, Givol, & Domany, 2005), (Díaz-Uriarte & Alvarez de Andrés, 2006). However, in order to reduce a chance for noisy and/or irrelevant genes to be included into one of such subsets, one needs to eliminate irrelevant genes before the actual classification.
You may also wonder why it is impossible to increase . The answer is that this is difficult as the measurement of gene expression requires a functionally relevant tissue taken under the right conditions, which is sadly rare due to impossibility to meet all requirements at once in practice (read more about these problems in (Kohane, Kho, & Butte, 2003)). So, we are left with the necessity to live and to deal with high dimensional data.
Below five popular gene expression data sets are briefly described in order to give a realistic picture of what gene expression data are.