Gene Expression Data Sets

Gene Expression Data Sets

DOI: 10.4018/978-1-60960-557-5.ch002
OnDemand PDF Download:
No Current Special Offers

Chapter Preview


Biological Data And Their Characteristics

Before embarking on a long tour across different machine learning methods, it is useful to look at some popular gene expression data sets and their characteristics.

First of all, a typical gene expression data set contains a matrix 978-1-60960-557-5.ch002.m01 of real numbers. Let 978-1-60960-557-5.ch002.m02 and 978-1-60960-557-5.ch002.m03 be the number of its rows and columns, respectively. Then 978-1-60960-557-5.ch002.m04 is represented as

where 978-1-60960-557-5.ch002.m06 represents the value in the 978-1-60960-557-5.ch002.m07th row and 978-1-60960-557-5.ch002.m08th column.

As there are thousands of gene expressions and only a few dozens of samples, 978-1-60960-557-5.ch002.m09 (the number of genes) is of order 1,000-10,000 while 978-1-60960-557-5.ch002.m10 (the number of biological samples) is somewhere between 10 and 100. Such a condition makes the application of many traditional statistical methods impossible as those techniques were developed under the assumption that 978-1-60960-557-5.ch002.m11. You may ask: what’s a problem?

The problem is in an underdetermined system where there are only a few equations versus many more unknown variables (Kohane, Kho, & Butte, 2003). Hence, the solution of such a system is not unique. In other words, multiple solutions exist. By translating this into the biological language of the applied problem treated in this book, this means that multiple subsets of genes may be equally relevant to cancer classification (Ein-Dor, Kela, Getz, Givol, & Domany, 2005), (Díaz-Uriarte & Alvarez de Andrés, 2006). However, in order to reduce a chance for noisy and/or irrelevant genes to be included into one of such subsets, one needs to eliminate irrelevant genes before the actual classification.

You may also wonder why it is impossible to increase 978-1-60960-557-5.ch002.m12. The answer is that this is difficult as the measurement of gene expression requires a functionally relevant tissue taken under the right conditions, which is sadly rare due to impossibility to meet all requirements at once in practice (read more about these problems in (Kohane, Kho, & Butte, 2003)). So, we are left with the necessity to live and to deal with high dimensional data.

Below five popular gene expression data sets are briefly described in order to give a realistic picture of what gene expression data are.

Complete Chapter List

Search this Book: