TopKernel Methods And Feature Selection
You already met with kernels in chapter on support vector machines. However, because of my claim that each chapter in this book is self-sufficient and self-contained, I would like to briefly reproduce the definition of kernel methods once again. Here, Wikipedia comes to our help (http://en.wikipedia.org/wiki/Kernel_methods).
Kernel methods are a class of pattern analysis (regression, clustering, correlation, classification) algorithms that approach the problem by projecting the data into a high dimensional feature space1
, where relations hidden in the data under study can be better revealed than when doing pattern analysis in the original space
of features. A good thing is that one does not need to know the explicit form of this projection, since it can be done via the inner products2 between the images3 of all pairs of data in the high dimensional feature space
. The inner products are embedded into the definition of a kernel function (hence, the name ‘kernel methods’). The complete set of all inner products forms a kernel matrix. Another important thing about kernel methods is that computing the kernel function is cheaper than the explicit computation of the projection
.
Readers who by some reason missed the chapter on support vector machines may wonder why the mapping to the space which dimension can be much higher than that of the original space is beneficial in our microarray setting. To address this worries and healthy doubts, I can say (after all the pioneers of kernel methods) that kernel matrix size is
, where
is the number of instances in the training set. Since for gene expression data, it is common that the number of instances is much smaller than the number of genes whose expression levels were measured, we obtain a compact structure needed for pattern analysis.
In this chapter, by pattern analysis, we mean looking for dependence between the features and the class labels in the kernel-induced space. The key pre-assumption is that good (relevant for classification) features will maximize such dependence. The dependence itself is formulated through a mutual-information (mutual dependence of two variables) like quantity, which is introduced in the next section.