A Hybrid Approach Based on Self-Organizing Neural Networks and the K-Nearest Neighbors Method to Study Molecular Similarity

A Hybrid Approach Based on Self-Organizing Neural Networks and the K-Nearest Neighbors Method to Study Molecular Similarity

Abdelmalek Amine, Zakaria Elberrichi, Michel Simonet, Ali Rahmouni
Copyright: © 2013 |Pages: 22
DOI: 10.4018/978-1-4666-2455-9.ch113
(Individual Chapters)
No Current Special Offers


The “Molecular Similarity Principle” states that structurally similar molecules tend to have similar properties—physicochemical and biological. The question then is how to define “structural similarity” algorithmically and confirm its usefulness. Within this framework, research by similarity is registered, which is a practical approach to identify molecule candidates (to become drugs or medicines) from databases or virtual chemical libraries by comparing the compounds two by two. Many statistical models and learning tools have been developed to correlate the molecules’ structure with their chemical, physical or biological properties. The role of data mining in chemistry is to evaluate “hidden” information in a set of chemical data. Each molecule is represented by a vector of great dimension (using molecular descriptors), the applying a learning algorithm on these vectors. In this paper, the authors study the molecular similarity using a hybrid approach based on Self-Organizing Neural Networks and Knn Method.
Chapter Preview


Functions of similarity are used in many fields, in particular in Data Analysis, Form Recognitions, Symbolic Machine Learning, and Cognitive Sciences.

In a general way, a function of similarity is defined in a universe U that can be modelled using a quadruplet: (Ld, Ls, T, FS).

  • Ld is the language of representation used to describe the data.

  • Ls is the language of representation of the similarities.

  • T is a set of knowledge that we possess on the studied universe.

  • FS is the binary function of similarity, such as: FS: Ld x Ld → Ls

When, the function of similarity has for object to quantify the resemblances between the data, the Ls language corresponds to the set of the values in the interval [0...1] or in the R+ set and we will speak then of similarity measurement (Bisson, 2000).

Most works concerning the similarity measures have as base the mathematical concept of distance (the inverse notion of similarity) which was well studied in DA (Mahé & Vert, 2007; Bisson, 2000).

It is defined in the following way: let Ω the set of the individuals of the studied domain a metric D which is a function of Ω X Ω in R+, ∀a, b, c∈ Ω.

  • 1.

    D(a, a) = 0 (property of minimality)

  • 2.

    D(a, b) = D (b, a) (property of symmetry)

When the function D verifies the properties 1 and 2, it is called index of dissimilarity (or more simply a dissimilarity).

The other properties are also interesting:

  • 3.

    D(a, b) = 0a = b (property of identity)

  • 4.

    D(a, c)D(a, b) +D(b, c) (triangular inequality)

  • 5.

    D(a, c)Max [D(a, b),D(b, c)]

Complete Chapter List

Search this Book: