Preparation of a molecular database from a set of 2 million compounds for virtual screening applications : gathering, structural analysis and filtering. |
Jean-Christophe Mozziconacci, Eric Arnoult, Nicolas Baurin, Christophe Marot and Luc Morin-Allory
Institut de Chimie Organique et Analytique
UMR CNRS 6005 Université d'Orléans
45067 ORLEANS BP6759 Cedex 2 - France
Keywords : Chemical libraries, uniqueness, diversity, drug-likeness, filters, surface descriptors, MACCS fingerprints, "rule of 5"
Abstract :
Rational selection of compounds is required to obtain a molecular database suitable for virtual screening applications. For that purpose, firstly 2 million compounds from 15 commercially or freely available chemical libraries were gathered. Structural analysis was carried out to assess the uniqueness and diversity of these libraries using surface descriptors and MACCS fingerprints. The drug-likeness of these compounds was then investigated using common chemical features such as the "rule of five", the flexibility, the atom types and the functional groups. Based on this information, successive filters were designed to extract a drug-like subset of compounds. Finally, a set of about 1 million unique and drug-like compounds was obtained. This resulting set is an interesting tool for rapid virtual screening process. It can also be reduced using a diversity search to a smaller set for slower screening methods.
Introduction :
Recent development in combinatorial chemistry and high-throughput screening (HTS) have significantly increased the number of compounds for biological essay. Commercially and freely available libraries have now been proposed by several suppliers and make up a valuable and large amount of data. High-throughput virtual screening methods were also developed to handle these large sets of compounds and to improve the "hit-rate" of the discovery programs.
However, the results of these methods depend strongly on the content of the screening libraries. In fact, redundant and non appropriate (reactive and toxic) structures have to be removed. Moreover, a library with a high diversity is required to increase the likelihood that novel active compounds are discovered during the screening process. Analysis and filtering of these databases are required to obtain a suitable subset of compounds for virtual screening projects. Such analyses have already been carried out [1,2] and several methods have been developed to improve the drug-likeness of the screening libraries. Models [3,4], pharmacophore features [5,6], rules and knowledge-based methods [7,8] have been designed to identify drug-like molecules. Each of these methods have their strengths and limitations and none of them can up to now achieve entirely the distinction between known drugs and other compounds. Moreover, drug-likeness is not a straightforward principle.
In this study, 15 libraries were gathered from diverse origins. Some of them have never been documented in the literature. We intended to make the best possible use of this readily available, worthwhile and ever growing (tens of thousand compounds a year) source of structures.
Then, analyses were performed on these databases. The uniqueness of the libraries was compared. Their diversity was investigated and visualized graphically, using MACCS fingerprints and recently designed 2D surface descriptors. The drug-likeness of these databases was also assessed using some common chemical features.
Next, rules applied on these features were used as filters to extract a subset of compounds which is more convenient for virtual screening application than the initial database. The method used is simple, quick and makes it easy to adapt the content of the resulting library in a pragmatic way.
Materials & methods :
Databases gathered
15 libraries were gathered (table 1). Most of them are freely available on www for academics. Some of these databases are HTS oriented and subsets may be focused on specific targets like GPCR, kinase, protease or ion channel. The resulting library consists of around 2 million compounds from pretty diverse origins, including both natural source extracts, semi-natural compounds, synthetic compounds and even reagents. Most of these compounds are readily available from the suppliers. In this respect, the diversity of the compounds gathered should lead to a valuable library for virtual screening.
Company | Library name | www address |
AcbBlocks | ACB | |
Asinex | Asinex | |
Key Organics | Bionet | |
ChemBridge | ChemBridge | |
ChemDiv | ChemDiv | |
ChemStar | Chemstar | |
InterBioScreen | IBS | |
MayBridge | MaybBridge | |
Molecular Diversity Preservation International | MDPI | |
Micro Source Discovery Systems | MsDiscovery | |
Nanoscale Combinatorial Synthesis | NanoSyn | |
NCI/NIH Developmental Therapeutics Program | NCI | |
Timtec | Timtec | |
Tripos | Tripos | |
Institut de Chimie Organique et Analytique | ICOA | Corporate database |
Figure 1 : Origin of the libraries gathered.
The SDF files were downloaded or received from suppliers and were stored in a MOE [9] molecular database. The structures were reduced to a single connected component (counter-ions removed). The hydrogens were added and the ionization state of basic and acidic groups was adjusted. A 3D structure was generated for each molecule in order to obtain a handy visualization scheme.
Redundancy
The internal duplication rate in each database and the redundancy between databases were evaluated. Non-stereospecific SMILES codes were used because of the lack of any stereochemistry information in some libraries.
In the following parts of this study, all except one stereoisomer of each compound were removed. Consequently, only one compound with an arbitrary isomer was kept at random. Nevertheless, a copy of the whole database was stored. It contains information about suppliers and compounds packaging. This information is required to purchase the candidate molecules for biological testing after virtual screening.
Diversity
The molecular diversity of each database was quantified using 2D molecular descriptors based on atomic contributions to van der Waals surface area, log P, molecular refractivity, and partial charge. These 52 P_VSA descriptors [10] were calculated with MOE. They have the advantage of being conformation independent and so very quickly calculated. They are supposed to capture hydrophobic and hydrophilic effects, polarizability and electrostatic interactions. Moreover, they form a meaningful low-dimensional chemical space.
The descriptors were scaled and centred using in-house script written in Python. Then, similarity indices between compounds of each database were calculated using the Tanimoto coefficient with an in-house C++ program. The diversity of a database was expressed as the fraction of compounds which has a similarity index < 0.6, 0.7 or 0.8, on average, with the other compounds of the database.
The set of 166 MACCS structural keys [11] were also calculated for each compound. Then, it was reduced with principal component analysis (PCA) using MOE. The compounds were displayed on a graph using the 2 first latent variables with gnuplot software. In this way, the diversity and the location of each database in the chemical space were graphically compared.
Drug-likeness
Firstly, the fraction of "non organic" structures (i.e. compounds which contain atoms different from Carbon, Hydrogen, Oxygen, Nitrogen, Sulphur, Phosphorous and Halogens) was determined for each library. The compounds which contain C+, C-, O+ or N- atoms were also selected as wrong structures. They come from importation mistakes. Automatic correction of these structures is in progress.
Secondly, the fraction of compounds which is consistent with the widely used "rule of 5", introduced by Lipinski, was evaluated. It provides a guide for determining if a compound will be orally bioavailable. For such compounds, 3 rules out of 4 are true :
hydrogen bond donor (OH and NH) £ 5
hydrogen bond acceptor (O and N) £ 10
molecular weight £ 500
logP £ 5
In this study, the SlogP descriptor of MOE (calculated according to Wildman and Crippen's method) was used to evaluate the lipophilicity. For a set of test compounds, the MlogP used by Lipinski and the SlogP were compared. As values were of the same order, SlogP can be used as a lipophilicity descriptor in the "rule of 5".
Next, the drug-likeness evaluation of the libraries was improved with other properties which have been chosen according to the literature :
reactive function in the molecule (these compounds are likely to be toxic, unstable or may interfere with assays)
the number of rotatable bonds and single chains
the number and the maximum size of the rings
the number of halogens and the molecules which contain perfluorinated chains
at least one nitrogen or oxygen atom in the molecule
Filtering
According to the previous analyses, the 2 million compound database was cleaned to extract a subset of valuable compounds for virtual screening. The redundant and the least drug-like molecules were removed. The limit chosen for each property will be discussed. Instead of removing all the compounds which do not pass one of these filters, "flags" are used to mark molecules which may be problematic. The more flags a compound has, the less it’s considered to be drug-like.
All these calculations and the analyses (descriptor combination, SMILES searches and profile building) were carried out using an SVL script in MOE. A graphical interface allowing the user to choose the filtering steps which were performed.
Materials
The calculations were carried out on windows PC 1.7GHz with 1Go RAM. Most of the calculations were very quick (few hours), except the similarity calculations with a C++ program which lasts about 15 days.
Results& discussion :
Gathering
The 15 libraries (figure 1) gathered led to a 2 million compound library. Uniqueness, redundancy, diversity and drug-likeness analyses were carried out.
Redundancy
Each bar of the histogram on figure 2 corresponds to a library and displays different fractions of compounds. The libraries are sorted according to their initial size.
The two first fractions correspond to:
the doubloons in each library (blue)
the fraction of non-organic compounds and wrong structures (black).
The 3 remaining fractions are:
the compounds which are found only in this chemical library (red)
the fraction of compounds which are common to the one with the largest redundancy among the 14 other libraries (green)
the other redundant compounds with other libraries (yellow)
The mean redundancy of each library is also reported below.

Figure 2 : Redundancy in each and between the 15 libraries
The fraction of doubloons in each database is quite low (less than 8 %), except in MsDiscovery and Chembridge libraries. In these 2 databases, most of the doubloons correspond to molecules which are available in several packagings.
Similarly, the fraction of non organic compounds and wrong structures is usually about 2 %, however, it is the largest in the NCI and Tripos libraries (more than 5 %). Actually, in the Tripos library, most of these compounds are "wrong" structures, due to problems during importation. In the NCI, 2/3 of these compounds contain non organic atoms.
The Bionet is the least redundant database: only 7.3 % of the compounds are found, in at least one of the other 14 libraries. The ICOA library is very slightly redundant, but it is also the smallest database. The NCI database is also not very redundant: 9.8 % of the compounds are found in another library. It's important to note that NCI is 7.5 times larger than Bionet. As a result the NCI database represents a larger number of non redundant compounds. By contrast, the Nanosyn library is the most redundant one with 89.3 % of its compounds found in another database. The size and the chemical origin of the compounds does not make a great impact on the redundancy. Libraries with less than 20 % redundant compounds are either small (ICOA and ACB less than 5,000 structures) or large (NCI and Tripos more than 90,000 structures). Compounds of the ICOA library come from "classic" organic chemistry, whereas compounds of the Bionet library come mainly from combinatorial chemistry.
The results suggest that the uniqueness differs greatly from one base to another. The size of the libraries is not an evidence of uniqueness. Nevertheless, each library bring some compounds which can not be found in another library. So there are no useless library.
In the next part, the doubloons, redundant and non organic compounds have been removed from the libraries in order to keep only the relevant information for virtual screening. So the resulting database contains about 1.2 million compounds.
Diversity
Using metric
In the figure 3, the libraries have been sorted according to their diversity (S < 0.7). The 3 bars represent the fractions of molecules which are, on average, < 0.6, 0.7 and 0.8 similar to the other compounds in the library. The higher these values are, the higher the diversity of the database is.

Figure 3 : The diversity of the libraries assessed with P_VSA descriptors.
The NCI, MDPI, MsDiscovery and Maybridge databases to a minor extent are the most diverse libraries. For instance, 31 % of the molecules in the NCI library are < 0.6 similar to the other molecules of this database and 95% are < 0.8 similar to the other molecules. The diversity of the remaining databases is lower than the previous ones, especially, the ICOA database is the least diverse. Less than 40 % of the molecules are < 0.8 similar to the other compounds. These products come from one single lab which certainly explains this result.
Finally, a wide range of diversity is noticed among the analyzed libraries.
PCA
The databases diversity was illustrated using PCA plots on MACCS fingerprints. The molecules of each library were displayed on the chemical space defined by the first 2 PCA axes calculated with the whole 1.2 million unique compound database. These first 2 principle components explain 18 % of the variance.

Figure 4 : PCA plots of 3 libraries (Bionet, Tripos and NCI respectively, in blue), compared to the whole database (in red)
On the PCA plots, some of the libraries (Asinex, ChemBridge, IBS, MDPI, MsDiscovery, NCI, Timtec) are spread over all the chemical space. This illustrates the diversity of these databases. The other libraries (Tripos, ChemStar, Bionet and, to a minor extent, Maybridge and Nanosyn) are more or less located in a part of the chemical space (see figure 4). It demonstrates the specificity of these libraries. We obtained roughly the same results in the chemical space defined only by the drug-like compounds. Actually, these plots illustrate the differences of diversity between databases.
Filtering - Drug-likeness
Rule of 5
The drug-likeness is expressed as the percentage of a library consistent with either the rules of Lipinski (compounds which pass at least 3 criteria) or a more demanding filter (with the 4 criteria). Non unique and "non organic" compounds have already been removed (figure 5).

Figure 5 : The drug-likeness according to the "rule of 5".
Using these rules the percentage of "drug-like" molecules varies from 100% for ACD to 93.2% for Timtec (about 95% for the global library). Using the more drastic interpretation, the percentage varies from 93.6% for ACD to 73.5 % for IBS (around 77% for the global library).
Finally, the results suggest that the drug-likeness of the databases is about constant according the "rule of 5". Larger differences are observed between the databases with more drastic criteria.
However, it has been shown that a large fraction of the drug is not consistent with the "rule of 5". For most of these compounds, molecular weight or LogP is too high. Therefore, these limits were enlarged. Due to the screening objectives, only the compounds which are definitely not convenient have to be removed.



Figure 6 : Profile of each property of the "rule of 5" and rotatable bond, in the 1.2 million database of unique compounds.
The new thresholds were chosen according to the profile of the figure 6 in order to decrease the number of wasted compounds.
| logP | >7 |
| molecular weight | 100 < MW < 800 |
| HB acceptors | > 10 |
| HB donors | > 5 |
Nevertheless, these features are obviously too crude to really assess the drug-likeness.
An overview of the subset, which has been selected with the rule of 5, showed that numerous compounds still do not have desirable properties.
Therefore, further properties were investigated.
Further properties
8 additional filters were used to identify non drug-like compounds in the databases.
| rotatable bonds | > 15 |
| reactive functions | > 0 |
| number of halogens | > 7 |
| single chains | > -(CH2)6CH3 |
| perfluorinated chains | > -CF2CF2CF3 |
| rings | > 6 |
| big ring | > 7-membered rings |
| Oxygen and nitrogen | = 0 |
The filter limits were chosen according to the drug database (CMC, WDI or MDDR) analyses reported in the literature [12,13]. In order to prevent to the rejection of too many compounds, some limits were enlarged in accordance with the profile (e.g. rotatable bond on figure 6) of the corresponding property in the studied libraries. Indeed, these filters are still a crude way to select drug-like compounds and the goal of this study is to remove only the worthless compounds for virtual screening applications, in a pragmatic way.
In Figure 7, the libraries are sorted according to the fraction of compounds with no flag, which are supposed to be the most drug-like ones. This fraction differs greatly from one base to another : it ranges from 60 to almost 90%. Most of the libraries have only a slight fraction of compounds with more than 3 flags (< 1 %). NCI, MDPI, MsDicovery are the libraries with the largest fraction of > 2 flags compounds. This indicates that these databases have to be filtered.

Figure 7 : The drug-likeness comparison of the 15 libraries using the 12 filters.
The drug-likeness of each library evaluated with these 12 filters and with the "rule of 5" differs greatly. A large fraction of the Maybridge, NCI, ACB compounds passed the rule of 5, however they failed with the 12 filters. Controversially, ICOA and, to a minor extent, ChemBridge database (both from "classical" organic chemistry) are rather drug-like according to the two methods. Molecules in the small databases did not appear more drug-like than those in the large ones.
Finally, about the same number of compounds are selected by the 2 ways. Nevertheless, the 12 filters have the advantage of being less crude than the "rule of 5" since they focus on a larger number of features.
The PCA plot of the compounds selected with the "rule of 5" or the 12 filters did not show any specific location, which would correspond to drug-like compounds.
Filtering
In figure 8, the 12 filters are ordered according to the number of compounds removed from the whole library.

Figure 8 : The number of compounds removed with each of the 12 filters in the whole library.
Most of the compounds are removed because they contain at least one reactive function (15.5 %). However, this filter is one of the most debatable since drugs with reactive functions exist. Therefore, it may be better to consider compounds with only that flag.
Rotatable bonds and logP only remove about 3 % of the compounds using the limits described previously.
As expected, the filters concerning the halogens only remove a slight fraction of compounds. However only a few compounds (< 0.5 %) are selected by the HB donors, HB acceptors or MW filters, which was lire surprising. More than 0.6 % of the compounds have a structure without oxygen and nitrogen atom.
Resulting database
About 76% of the 1.2 million unique compounds did not have a flag (Figure 7). So, around 300,000 compounds were removed by these 12 filters. Compounds with < 2 flags represent 94 % of the whole library.
Actually, our ability to define global rules for drug-likeness assessment is limited. Some flags are more or less important depending on the targets, the therapeutic areas, or the screening objectives. Thus, it may be useful to adapt the filter limits and the maximum number of flags allowed.
Finally, from the 2 million compounds initially gathered, about 1 million unique compounds with suitable properties are kept after filtering. This subset is still large significant and represents a valuable library for virtual screening applications.
Conclusions & perspective :
In this study, we intended to build a suitable library for virtual screening applications. For that purpose, 2 million compounds were collected from 15 databases. We showed that the redundancy between these libraries differs greatly, but each database contains a fraction of unique compounds. Similarly, the analysis of their diversity using surface descriptors showed large differences from one base to another. These results were illustrated using PCA on MACCS fingerprints. The drug-likeness of these compounds was first assessed using the "rule of five". Then additional features were used to investigate more deeply the fraction of drug-like compounds in each database. Using filters based on these features, a subset of around 1 million unique and drug-like compounds was selected from the original databases. It represents a more convenient library for virtual screening applications which should allow an efficient drug lead discovery.
We now intend to screen this resulting library with 2D-QSAR techniques and docking methods used in the laboratory [14,15]. Additionally, a diverse subset of compounds can be extracted in order to reduce this database. Appropriate filters also could be designed to bias the library for a specific receptor. More complex filters (reactive, "frequent hitters" [16]) could be also designed to improve the current compound selection. In addition, a conformational search is currently in progress in order to use this database for pharmacophoric and 3D-QSAR applications.
References :
[1] Voigt, J. H.; Bienfait, B.; Wang, S.; Nicklaus, M. C. Comparison of the NCI Open Database with Seven Large Chemical Structural Databases. J. Chem. Inf. Comput. Sci. 2001, 41, 702-712.
[2] Bradley, M. P. An overview of the diversity represented in commercially-available databases. J. Comput.-Aided Mol. Des. 2002, 16, 301-309.
Walters, W. P.; Murcko, M. A. Can We Learn To Distinguish between "Drug-like" and "Nondrug-like" Molecules? J. Med. Chem. 1998, 41, 3314-3324.[4] Sadowski, J.; Kubinyi, H. A Scoring Scheme for Discriminating between Drugs and Nondrugs. J. Med. Chem. 1998, 41, 3325-3329.
Muegge, I.; Head, S. L.; Brittelli, D. Simple selection criteria for drug-like chemical matter. J. Med. Chem. 2000, 44, 1841-1846.[6] Muegge I. Pharmacophore features of potential drugs. Chem. Eur. J. 2002, 8, 1976-1981.
[7] Oprea, T. I. Property distribution of drug-related chemical databases. J. Comp. Aid. Mol. Des. 2000, 14, 251-264.
[8] Xu, J.; Stevenson, J. Drug-like Index: A new approach to measure drug-like compounds and their diversity. J. Chem. Inf. Comput. Sci. 2000, 40, 1177-1187.
[9] MOE : The Molecular Operating Environment from Chemical Computing Group Inc., 1255 University Street, Suite 1600, Montreal, Quebec, Canada H3B 3X3 http://www.chemcomp.com .
[10] Labute, P. A widely applicable set of descriptors. J. Mol. Graph. Model. 2001, 18, 464-477.
[11] Durant, J. L.; Leland, B. A.; Henry, D. R.; Nourse, J. G. Reoptimization of MDL keys for use in drug discovery. J. Chem. Inf. Comput. Sci. 2002, 42, 1273-1280.
[12] Walters, W. P.; Murcko, M. A. Prediction of 'drug-likeness'. Adv. Drug Deliv. Rev. 2002, 54, 255-271.
[13] Charifson, P. S.; Walters, W. P. Filtering databases and chemical libraries. J. Comput.-Aided Mol. Des. 2002, 16, 311-323.
[14] http://www.univ-orleans.fr/SCIENCES/ICOA/eposter/eccc8/index.html
[15] Baurin, N.; Mozziconacci, J.-C.; Arnoult, E.; Chavatte, P.; Marot, C.; Morin-Allory, L. Property labeled van der Waals Surface Areas models of COX-2 inhibition. Training, validation and insights into 2D-QSAR consensus prediction for high-troughoutput virtual screening. Submitted to J. Chem. Inf. Comput. Sci.
[16] Roche, O.; Schneider, P.; Zuegge, J.; Guba, W.; Kansy, M.; Alanine, A.; Bleicher, K.; Danel, F.; Gutknecht, E-M.; Rogers-Evans, M.; Neidhart, W.; Stalder, H.; Dillon, M.; Sjögren, E.; Fotouhi, N.; Gillespie, P.; Goodnow, R.; Harris, W.; Jones, P.; Taniguchi, M.; Tsujii, S.; von der Saal, W.; Zimmermann, G.; Schneider, G. Development of a virtual screening method for identification of "frequent hitters" in compound libraries. J. Med. Chem. 2002, 45, 137-142.