Receive a 20% Discount on All Purchases Directly Through IGI Global's Online Bookstore

Sébastien Gadat (Université Paul Sabatier, France) and Sébastien Gadat (Université Paul Sabatier, France)

Copyright: © 2011
|Pages: 23

DOI: 10.4018/978-1-61520-991-0.ch006

Chapter Preview

TopHigh dimensional data Most of nowadays encountered examples of face analysis tasks involve high-dimensional input variables. To detect faces in an image, we usually consider the set of all possible gray-level pixels of the image as informative features. In this case, features are considered as “low level” features contrary to more sophisticated “high level features” built from linear or non linear combination of the gray-level pixels such as linear filters, edges (mostly coming from some thresholding step), Fourier and Wavelet coefficients, principal components …

In the case of high level features, the number of variables may thus become very large and can even exceed the number of images. Imagine for instance one thousand samples of images described by 60 × 85 pixels, signals are thus defined by 5100 low-level features and the number of samples become rapidly too small to describe efficiently the data. Consequently, algorithms involving face detection tasks must generally solve some large dimensional problems while a large amount of variables may not be so good to reach accuracy and robustness properties as we will see in the next paragraph.

In this chapter, we focus on the problem of selecting features associated to the detection of “faces” (versus “non-faces”) when images are described by high-dimensional signals.

A statistical remark and the curse of dimensionality Let us first discuss in further details why abundance of variables can significantly harm classification in the context of face detection. From a statistical point of view, it is important to remove some irrelevant variables which act as artificial noise in data, especially in the case of images, and limit accuracy of detection tasks.

Moreover, in high dimensional spaces, we face the curse of dimensionality and it is generally impossible to draw some reliable conclusions from databases built with little number of examples regarding some large number of features. This phenomenon has been first pointed by Bellman (1961). Let us describe briefly this statistical property which can be summarized as the exponential growth of hyper-volume as a function of dimensionality. Consider for instance the learning task of face detection. The signal is denoted I while the presence of a face in the signal is Y (Y = 1 when a face is present and Y = 0 otherwise). A good statistical prediction of Y given I corresponds to a good knowledge of the joint law Y|X. If we called the dimension of the space E_{d} where I lives, the problem is equivalent to find a correct interpolation of this joint law given a Learning Set of size N when the samples are drawn in the space E_{d}. As d increases, the number of samples N necessary to build a design of fixed precision increases exponentially. For instance, N = 100 uniformly-spaced sample points suffice to sample a unit interval of dimension one with no more than 0.01 distance between points although an equivalent sampling of a 10-dimensional (d = 10) unit hypercube with a lattice spacing of 0.01 between adjacent points would require N = 10^{20} sample points. Thus, it will be exponentially hardest to approximate the joint law Y|X when d is increasing.

In addition, let us remind the classical Bias/Variance trade-off (see (Geman, Bienenstock, and Doursat, 1992). If one want to predict Y with a function f applied to signal I using observations in a Learning Set D, the square loss decomposition for the prediction f(X) is given as

In the former decomposition, the bias is typically a non increasing function of the dimension d while the variance term may drastically increase with d. It is thus necessary to follow a Bias/Variance trade-off to find a good prediction f and a good complexity d. One of the goals of the selection of features is to reduce this phenomenon by restricting the set of features to the “good” ones.

At last, we can notice some efficiency issues, since the speed of many classification algorithms is largely improved when the complexity of the data is reduced. For instance, the complexity of the q-nearest neighbour algorithm varies proportionally with the number of variables. In some cases, the application of classification algorithms like Support Vector Machines described in Vapnik (1998) or q-nearest neighbours on the full features space is not possible or realistic due to the time needed to apply the decision rule.

Search this Book:

Reset