Advances in Relevant Descriptor Selection

Advances in Relevant Descriptor Selection

Željko Debeljak (University of Zagreb, Croatia) and Marica Medic-Šaric (University of Zagreb, Croatia)
DOI: 10.4018/978-1-60960-860-6.ch008

Abstract

During the last few decades the number of available molecular descriptors has grown exponentially. A reduced set of descriptors, containing only relevant descriptors, enables better understanding of the interaction between the molecule and some biological entity and in turn, it enables more reliable molecular modeling and chemical database mining. As a consequence, many new off-line and on-line descriptor selection methods have emerged. Overview of the most important feature selection methods, their advantages, disadvantages and applications in SAR and QSAR is given.
Chapter Preview
Top

Background

During the last few decades the number of available molecular descriptors became substantially larger. In most situations this number is practically infinite. Without previous detailed knowledge about the phenomenon at hand, it is difficult to conduct the unsupervised off-line descriptor selection that is theoretically justified. Modern QSAR studies, therefore, frequently include thousands of descriptors generated for a few dozens of molecules. From the statistical point of view, the estimation of QSAR model parameters is highly unreliable or impossible in the described settings. Even the application of the machine learning approaches to the model development leads to serious problems. So-called “curse of dimensionality” (Guyon, 2003), caused by a large number of descriptors and their possible interactions, makes model development intractable. Even if obtained models pass the external validation criteria, their interpretation is blurred by numerous descriptors of unknown relevance. This property makes the rational molecular design or data base mining a very complex task.

Complete Chapter List

Search this Book:
Reset