A Simulation Study of the Use of Similarity Fusion for Virtual Screening

A Simulation Study of the Use of Similarity Fusion for Virtual Screening

Martin Whittle (University of Sheffield, UK), Valerie J. Gillet (University of Sheffield, UK) and Peter Willett (University of Sheffield, UK)
DOI: 10.4018/978-1-61520-911-8.ch004
OnDemand PDF Download:
List Price: $37.50


This chapter analyses the use of similarity fusion in similarity searching of chemical databases. The ranked retrieval of molecules from a database can be modelled using both analytical and simulation approaches describing the similarities between an active reference structure and both the active and the non-active molecules in the database. The simulation model described here has the advantage that it can handle unmatched molecules, i.e., those that occur in one of the ranked similarity lists that are to be fused but that do not occur in the other. Our analyses provide insights into why the results of similarity fusion are often inconsistent when used for virtual screening.
Chapter Preview


The discovery of novel bioactive molecules in the agrochemical and pharmaceutical industries is both costly and time-consuming, and there is hence much interest in techniques that can increase the cost-effectiveness of the discovery process. One such technique is virtual screening: the use of computational methods to rank a database of chemical molecules in order of decreasing probability of bioactivity. Attention can then be focused on those molecules at the top of the ranking since these are most likely to exhibit the activity of interest and are hence prime candidates for acquisition (or synthesis) and detailed biological screening (Alvarez & Shoichet, 2005; Bajorath, 2002; Eckert & Bajorath, 2007; Lengauer, Lemmen, Rarey, & Zimmermann, 2004; Oprea & Matter, 2004.

Many different approaches to virtual screening have been described in the literature, including: similarity searching (as discussed further below); pharmacophore mapping (where an attempt is made to identify the substructural features common to a set of known bioactive molecules); machine learning methods such as neural networks, support vector machines or decision trees (which can be used to classify an unknown molecule as a drug or a non-drug) and docking (which involves determining the degree of complementarity of a potential ligand to the binding site of the biological target). In this chapter, we focus on similarity searching, which is probably the simplest of the available techniques and which involves ranking a database of molecules in order of decreasing similarity to a known bioactive reference structure, (Eckert & Bajorath, 2007;Willett, 2006a). Given a ranked database, a set of molecules is retrieved by setting a threshold and then retrieving those molecules that come above this threshold in the ranking, e.g., the top-ranked 1000 molecules. The resulting set of retrieved molecules can then be used to determine the effectiveness of the similarity search, using one of the evaluation criteria that have been described in the literature (Edgar, Holliday, & Willett, 2000; Jain & Nicholls, 2008; Truchon & Bayly, 2007).

The rationale for similarity searching is the Similar Property Principle (Johnson & Maggiora, 1990), which states that molecules that are structurally similar are also likely to have similar properties. The ranking in a similarity search is effected using a quantitative measure of structural similarity, and many different types of measure have been reported in the literature (Bender & Glen, 2004; Sheridan & Kearsley, 2002; Willett, 2009). There have been several comparative studies that seek to assess the relative merits of different measures when used under the same conditions (Brown & Martin, 1996; Glen & Adams, 2006; Maldonado, Doucet, Petitjean, & Fan, 2006; Willett, 2006a; Willett, 2009). However, while it has been possible to identify measures of wide applicability it has not been possible to identify any single approach that will always result in optimal performance (Sheridan, 2007). This has led to the idea of using not one but multiple similarity measures, combining the resulting rankings using a technique known as similarity fusion (Willett, 2006b).

The basic fusion approach involves computing a similarity score (approximating, directly or indirectly, the probability of activity) for each molecule in a database using several different similarity measures. These sets of scores are then combined in some way to give a new, fused score that will provide a better ranking of the database than will any single similarity measure. Given a known reference structure that is to be searched against a database using m different similarity measures, a pseudo-code description of similarity fusion can hence be summarized as follows:

Complete Chapter List

Search this Book: