ANALYSIS OF A SET OF 2.6 MILLION UNIQUE COMPOUNDS GATHERED FROM THE LIBRARIES OF 32 CHEMICAL PROVIDERS


Aurélien Monge, Alban Arrault, Christophe Marot and Luc Morin-Allory
mail

ICOA UMR CNRS 6005
Université d’Orléans BP 6759, 45067 ORLEANS Cedex 2, France.




ABSTRACT

3.8 million of compounds from structural databases of 32 providers has been gathered and stored in our database. Once the duplicates are removed using the InChI [1], 2.6 million of compounds remain. The 32 databases and the whole database were studied in term of uniqueness, diversity, frameworks, drug-like and lead-like properties. This study shows that there are more than 87 000 frameworks in our database. There are 1.9 million of drug-like molecules among which more than 900 000 are lead-like. The druglikeness and leadlikeness are estimated using in house scores using function to estimate convenience to properties rather than cut-off values. The compounds are stored in a MySQL database and the code to manage this database is in Java. In consequence, we have a free and easily updatable system for chemical databases management and screening sets generation.

Keywords: Chemoinformatics, chemical databases, screening, drug-like, lead-like, fingerprints, diversity.

CONTENTS
INTRODUCTION

In a project of virtual or real screening, choosing the molecules to test is a very important step. According to the targeted number of molecules one or several appropriated providers which databases are the most representative of the whole available compounds must be chosen. Moreover the lead-like and drug-like compounds must be identified.
Our laboratory has already combined databases [2]. We have improved the management system of our resulting database using a database management software. We choose to use free tools such as MySQL and Java. Our system offers now an easy way to update the whole database using update files from providers.
The database contains now 2.2 million of unique compounds gathered from 32 providers. We used our system to analyse providers’ databases. The uniqueness, diversity, frameworks,  druglikeness, leadlikeness and the distribution of druglikeness related properties was studied.

MATERIALS AND METHODS

The database server MySQL is used on a Linux PC. All the code is programmed in Java except one method coded in C, using Java Native Interface (JNI), allowing launching batch operations on Windows in a proper way. For all the operations on chemical structures and for the calculation of descriptors the JOELib [3]  Java API was used.
The structural data (SDF, SMILES, InChI...) was separated from descriptors in order to speed up SQL queries on descriptors.

UNIQUENESS

Each imported compound have, if necessary, its counter-ion removed and is reprotonated. This new structure is added if it’s not already present in the database. To characterise the chemical structures we choose to use the IUPAC International Chemical Identifier (InChI) [1]. InChI is free and conceived in the perspective to become a standard unique code for molecules. This code is represented by a code in one text line where the sp2 and sp3 stereochemistry, isotopes and tautomerism are represented.
One of the reasons of the presence of duplicates in databases is undefined or badly defined stereochemistry. We also want to note that counter-ions are not taken into account to check duplicates.
We use the 1.12 Beta version of InChI which only have a basic support for aromatics bonds because this bond type is specified to be only for query in the MDL mol format description. As JOELib codes some structures using the aromatic bond type, these structures can not have an InChI code computed and are considered as unique. The structures without unique codes represent less than 2 % of the database, and the percentage of duplicates due to this problem must be very low. The support of aromatic bond type was improved in the next versions.

The database is also conceived to store a list of all the providers for a given compound. When a molecule is inserted, if this compound is in the database but with a different provider, the structure is not added. However, if the provider of this inserted molecule is not in the providers list of this molecule, the provider is added to this list. Moreover, if two molecules in a same file are detected as duplicates, this information is kept allowing to count internal duplicates in providers’ databases. However if two molecules from two different files of a same provider are detected as duplicates they are not considered as internal duplicates. In this case no information is kept about the second molecule added. This choice was done to update the database in an efficient way. Indeed, the providers can release new compounds in an updated file containing all the former molecules, and these molecules shouldn’t obviously be considered as internal doubloons.

To check the presence of a structure in the database, the MD5 hashcode of the InChI of this molecule is compared to the indexed MD5 hashcodes of the InChI of all the compounds in the database.  If two structures have the same MD5 hashcode their InChI code are compared to check if they are actual duplicates even if the probability that two structures got the same MD5 hashcode but not the same InChI code is very low (there is 1632 possibility for MD5 hashcodes so the collision probability is very low). Using MD5 hashcode allows to research duplicates in a smaller table with fixed size rows, which is more effective.

DATABASES

The databases have been gathered on the web or obtained by collaboration. We have obtained the databases of 32 providers (table 1).

Provider Web site Imported compounds * Origine
ACB Blocks http://www.acbblocks.com 61237 Organic synthesis
Akos http://www.akosgmbh.com 161316 Organic synthesis
AnalytiCon Discovery http://www.ac-discovery.com 5438 Pure and semi-synthetic natural products
Arkive http://ark.chem.ufl.edu/pages/arkive.htm 28504 Organic synthesis
Array BioPharma http://www.arraybiopharma.com 517 Organic synthesis focused primarily in cancer and inflammatory disease
Asinex http://www.asinex.com 348203 Organic synthesis
Aurora http://www.aurora-feinchemie.com 25295 Organic synthesis
BioFocus http://www.biofocus.com 23836 Organic synthesis focused primarily in kinase, GPCR and ion channel.
Chembridge http://www.chembridge.com 387859 Organic synthesis
ChemDiv http://www.chemdiv.com 451205 Organic synthesis and combinatorial chemistry
Chemical Block http://www.chemical-block.com 1993 Organic synthesis
ChemStar http://www.chemstar.ru 60066 Organic synthesis
Chem. Nat. http://chimiotheque-nationale.ujf-grenoble.fr/GDS/ 14946 Organic synthesis and natural products
Enamine http://www.enamine.relc.com 385175 Organic synthesis
ICOA http://www.univ-orleans.fr/icoa/ 2811 Organic synthesis
IFLab http://www.iflab.kiev.ua 100392 Organic synthesis
InterBioScreen http://www.ibscreen.com 345060 Organic synthesis
Key Organics Ltd http://www.keyorganics.ltd.uk 42414 Organic synthesis
Maybridge http://www.maybridge.com 71041 Organic synthesis
MDPI http://www.mdpi.org 8853 Organic synthesis
MSDiscovery http://www.msdiscovery.com 1982 Known drugs, experimental bioactives, and pure natural products
Nanosyn http://www.nanosyn.com 65184 Organic synthesis
NCI http://dtp.nci.nih.gov 244321 Organic synthesis and natural products focused in anticancer and anti-AIDS
Pharmeks http://www.pharmeks.com 83992 Organic Synthesis
Prestwick http://www.prestwickchemical.com 876 Marketed drugs and others
Sigma-Aldrich http://www.sigma-aldrich.com 162171 including plant extracts and microbial cultures
Specs http://www.specs.net 178492 Organic synthesis and natural products
TimTec http://www.timtec.net 122238 Organic synthesis and natural products
TOSLab http://www.toslab.com 21004 Organic synthesis and semi-natural compounds
Tripos http://leadquest.tripos.com 82370 Organic synthesis
VitasMLab http://www.vitasmlab.com/ 193993 Organic synthesis
Worldmolecules http://www.worldmolecules.com/ 33259 Organic synthesis
Table 1 : List of providers with number of molecules present in the database.  (* internal duplicates are not evaluated)

The database files of these 32 providers have a total of 3.8 million molecules. These compounds stemmed from classical organic synthesis, combinatorial chemistry or natural compounds extraction. The ICOA database is our corporate database, and is included in the French “Chimiotheque nationale” (Chem. Nat.) which gathers the databases of 17 French public laboratories [4].

DIVERSITY

We have used the dissimilarity step of the Stochastic Clustering Analysis (SCA) [5] to identify the number of clusters like it was done in a study of the NCI database [6]. As the clusters are created by diversity, the number of clusters gives us information about the diversity of the database. The descriptors used and stored in the database are the SSKey-3DS [7] fingerprints. The SSKey-3DS is constituted of 32 bits coding for the presence or absence of 32 fragments, and 22 bits which encode numerical values H bond acceptors, aromatic bonds, and fraction of rotatable bonds. We have programmed this algorithm in Java. The number of clusters of the whole database is investigated within one hour on a standard PC.

FRAMEWORKS

In a study of shape of drug-like compounds, Bemis used the notion of graph frameworks corresponding to ring systems connected to each other by linkers [8]. To obtain the graph framework of a hydrogen depleted structure of a compound, all the atoms of the molecule are replaced by “non-typed” atoms and all the bonds are replaced by “non-typed” bonds. Then, all atoms connected to only one other atom are removed. This step is repeated until no atom is deleted (figure 1).

framework

Figure 1: visualisation of the framework algorithm: 1) hydrogens are removed, 2) atoms with only one bond are removed, this step is repeated until it only remains atoms  with two bonds or more, 3) all atoms’ types are set to C, and all bonds’ type except aromatic are set to single.

In our implementation of this algorithm, non-type atoms are replaced by C atoms and non-type bonds are represented by single bonds, but unlike the Bemis method the aromatic bonds are differentiated of the non aromatic bonds. The advantage of this representation is that it can be stored as a structure, and represented with a molecular viewer. Furthermore we can compute InChI for the frameworks and store only unique frameworks.

DRUG-LIKE AND LEAD-LIKE PROPERTIES

Lipinski [9] rules are the most widely used to identify drug-like compounds [10]. Other techniques based on artificial neural networks have also been used [11], [12], [13].
First of all, compounds with atoms C, O, N, S, P, F, Cl, Br, I, Na, K, Mg, Ca, or Li are flagged. Next, we have used filters described in another study of our laboratory [2]:
The definition of the reactive functions used is the modified version by Oprea [14] of the list published by Rishton [15].

A recent review of Hann and Oprea gives rules to select lead-like molecules [16]. In our database a compound is lead-like if it is drug-like, with H bond acceptor (HBA)   9, molecular weight (MW) 460, -4 logP 4.2, rotatable bonds 10, and smallest set of smallest rings (SSSR) 4. The following definitions are used:

On the basis of this rules, we have evaluate drug-like and lead-like properties using two methods. The first method counts for the number of non-fitted criteria, the second method computes a progressive score based on these criteria.

HBD
HBD: the intermediate penalty zone stretches from 3.5 to 6.5 (former cut-off : 5).
HBA
HBA: the intermediate penalty zone extends from 7 to 13 for the druglikeness (former cut-off : 10) and from 6.3 to 11.7 for the leadlikeness (former cut-off : 9).
RB
Number of rotatable bonds: the intermediate penalty zone extends from 10.5 to 19.5 for the druglikeness (former cut-off: 15) and from 7 to 13 for the leadlikeness (former cut-off: 10).
SSSRs
Number of SSSRs: the intermediate penalty zone stretches from 4.2 to 7.8 for the druglikeness (former cut-off: 6) and from 2.8 to 5.2 for the leadlikeness (former cut-off: 4).
Max Ring Size
Maximum ring size: the intermediate penalty zone extends from 6 (6-member rings have no penalty) to 9.1 (former cut-off : 7).
X
Number of halogens: the intermediate penalty zone stretches from 4.9 to 9.1 (former cut-off : 7).
MW
Molecular Weigth: the lower intermediate penalty zone extends from 100 to 150 (based on the marketed drug weight distribution [20]). The upper intermediate penalty zone stretches from 350 to 800 for the druglikeness (500 – 30 % and the former limit 800 is kept because it was already very permissive) and from 322 to 588 for the leadlikeness (former cut-off: 460 [15]).
LogP
LogP: the lower intermediate penalty zone extends from -5 to -1.5, the upper intermediate penalty zone stretches from 4.5 to 7.5 for the druglikeness (based on the marketed drug logP distribution [20]) and from 2.9 to 5.5 for the leadlikeness (former cut-off : 4.2 [15]).

Figure 2: details of the penalty functions.

Linear equations for each criterion are used to compute a drug-like score. For each properties the minimal penalty is 0 (no penalty) and the maximum penalty is 1 (figure 2). We tolerate one unsatisfied property and in consequence molecules with a drug-like score 1 are considered as drug-like. Some of these properties add 2 to the score and then make the compound non drug-like if only one of these properties isn’t satisfied, even if all the other properties are satisfied. These properties are:
Therefore a molecule with a score 2 is definitively non drug-like. This method has two advantages against the sum of the number of unsatisfied drug-like criteria. Firstly, a compound with a property value slightly superior to a given value is not eliminated but is given a score penalty. Secondly this method allows having a progressive value of the druglikeness which allows sorting the compounds.

The lead-like score is designed exactly like the drug-like score but using the lead-like properties when they differ from the drug-like ones.

The selection of compounds with nitro group is not recommended, because the nitro group can cause false positives [18]. Although nitro group is not in our default frequent hitters list, all nitro compounds are flagged and can easily be removed if wanted.

We want to highlight that it doesn’t exist absolute lead-like and drug-like rules. It’s depends mainly of the project and of the type of test used. Although we have chosen parameters for each of these rules our system allows changing them easily to extract a new dataset of compounds. In addition to the classical parameters, we can eliminate from the dataset molecules with unwanted substructures.


RESULTS AND DISCUSSION

DISTRIBUTION

The four providers with the greatest number of available compounds (figure 3) are ChemDiv, InterBioScreen, ChemBridge and Enamine. However the originality of the structures of each provider must also be assessed in order to compare them.

compounds distribution

Figure 3: Distribution of the compounds of the whole database by providers.

First of all we must notice that some bases can be, in part or in full, the compilation of other databases. The ICOA and Chem. Nat. databases have been treated in a different way to the other databases in this study, because we know that the ICOA database (our corporate database) is included in the Chem. Nat. database. In consequence, for each of these two databases the molecules in common are not considered as duplicates (figure 4).
unicity

Figure 4: for each provider, percentage of compounds not presents in others providers’ databases.

Biofocus (100 %), Analyticon Discovery (99.96 %), ACB Blocks (97.9 %), Tripos (97 %) and ICOA (92.2 %) have the highest percentages of original compounds. Except for the very big databases, there is no direct relationship between the databases sizes and the percentages of original compounds. Indeed, Analyticon Discovery and ICOA are relatively small with respectively 5438 and 2811 compounds, but ACB Blocks and Tripos are larger with 61237 and 82370 compounds. The four biggest databases have between 36 and 85 % of original compounds.

duplicates

Figure 5: percentage of internal duplicates for each providers.

The databases Sigma-Aldrich (9.3 %), NCI (6.0 %), MDPI (5.3 %) and Arkive (3.6 %) have the biggest percentages of internal duplicates (figure 5). No duplicates were found in the databases of ACB Blocks, AnalytiCon Discovery, BioFocus and Tripos. The size of the library is definitively not linked to the number of internal duplicates. The best example is the ChemDiv library which is the biggest library and has only 0.02 % of internal duplicates.

DIVERSITY

The chemical space covered by a database is essential information. We used the dissimilarity step of the SCA algorithm with SSKey-3DS fingerprints to compute the number of clusters for the whole database and for each provider (figure 6). The NCI database is clearly the most representative of the chemical space and covers 59 % of the chemical space of the whole database. However this database can’t be considered as a commercial database. After the NCI, Enamine (37 %), ChemDiv (36 %), InterBioScreen (35 %), Sigma-Aldrich (35 %), ChemBridge (34 %) are the databases which are the most representative of the global diversity. The less representative database is ArrayBioPharma (0.5 %), which is normal because this is the smallest database (517 compounds).

diversity
Figure 6: diversity against the whole database for each provider.

We expect that the biggest databases should also be the most diverse. We have studied the relationship between the number of compounds in a database and the diversity (number of clusters) of this database. We can see, in figure 7, a rapid linear increase for the databases with less than 100 000 compounds, then for the databases of more than 150 000 compounds the increase of the diversity is slower. The NCI with 10 623 clusters for 250 000 compounds is an outlier.

linear diversity

Figure 7: increase of the diversity with the size of the databases.

FRAMEWORKS

frameworks

Figure 8: % of the whole database frameworks represented for each providers.

87 000  frameworks were found in the whole database. The figure 8 shows the percentage of the frameworks of the whole database for each provider. Unlike the results obtained by the diversity study, the NCI is not the most representative of the whole database but comes in fifth position (19 %). Enamine is the first (33 %) followed by ChemDiv (26 %) and InterBioScreen (23 %). Among the commercial databases, the three with the most important number of frameworks are also the most diverse in the previous part.

The three databases with the most representative frameworks are also the three most diverse, in the same order. The less representative databases are ArrayBioPharma (0.02 %), Chemical Block (0.29 %) and Prestwick (0.39 %). We can see in figure 9 that the number of frameworks is highly correlated to the size of the databases with a R² = 0.89; this correlation explains the previous results.

linear frameworks

Figure 9: linear regression to link the number of frameworks to the number of compounds (R² = 0.89).

DRUG-LIKE

We have studied the "drug-like" properties of the bases with two approaches. A classic one which, for each product, computes the number of violations of the limits of the rules, the second one which uses the score presented in the previous section.

For each provider, the numbers of molecules with 0, 1, 2 and more than 2 drug-like failures are represented figure 10.
drug-like failures

Figure 10: percentage of drug-like failures for each provider’s database.

All the libraries have a high ratio of molecule of with 0 or 1 drug-like failures. The library with the lower percentage of molecules with none drug-like failures is Tripos with 55 %.  ACB Blocks (91 %), Aurora (89 %), InterBioScreen (87 %), Chemical Block (87 %) are the libraries with highest percentages of compounds without drug-like failures. Among the libraries only Array BioPharma has 0 % of molecules with 2 or more drug-like failures.
The other method to estimate drug-like properties is drug-like score (figure 11).

drug-like score

Figure 11: percentage of drug-like scores for each provider’s database.

If we consider as “drug-like” the compounds with a drug-like score 1, the providers with the largest percentage of drug-like compounds are ACB Blocks (88 %), Chemical Block (88 %) and Aurora (87 %).The distribution of the drug-like score in the whole database is shown in figure 12.

drug-like score distribution

Figure 12: drug-like score distribution.

The relative importance of the drug-like filters is shown in figure 13. Much of the compounds (14 %) are removed because of reactive functions.

drug-like filters influences

Figure 13: influence of drug-like filters.

LEAD-LIKE

The lead-like failures are represented in figure 14.
lead-like failures

Figure 14: percentage of lead-like failures for each provider’s database.

If we consider the molecules without lead-like failures as lead-like, ChemicalBlock (79 %) and Array BioPharma (73 %) are the most “lead-like” libraries. Analyticon Discovery (19 %) and Tripos (22 %) are the databases with the fewer lead-like compounds. These results are coherent with the lead-like score presented in figure 15.

lead-like score

Figure 15: percentage of drug-like scores for each provider’s database.

If we consider the compounds with chemical scores 1 as lead-like, the conclusions are the same than selecting compounds with none lead-like failures. ChemicalBlock (80 %) and Array BioPharma (76 %) have the largest percentage of lead-like compounds and Analyticon Discovery (10 %) and Tripos (20 %) have the fewer lead-like compounds.
We can see in figure 15 that the distribution of the lead-like score is linearly progressive on the whole database. In consequence, this function can be very useful to sort compounds by leadlikeness.

lead-like score distribution
Figure 15: lead-like score distribution.

The figure 16 shows the logP filter is the most selective of the lead-like filters and removes 48 % of the compounds.

lead-like filters

Figure 16: influence of drug-like filters.

DIVERSITY IN THE LEAD-LIKE SPACE

We have already compared the chemical space covered by each database counting the number of clusters created by diversity in a database. However the diversity in a database can be added by compounds which aren’t drug-like. We present here a second study of the coverage of the diversity space by the databases, but this time we have limited our study to the lead-like space (figure 16).

lead diversity
Figure 16: lead-like space of the whole database covered by each provider.

The NCI (6423 clusters) is the first of this sorting, next come InterBioScreen (4047 clusters), Chembridge (4042 clusters), ChemDiv (3888 clusters), Enamine (3880 clusters) and Sigma-Aldrich (3720 clusters). In figure 6, the more diverse were NCI, Enamine, ChemDiv, InterBioScreen, Sigma-Aldrich and Chembridge. So we can see that the sorting of the databases by diversity is dependent of the chemical space studied. The last database of the sorting is Analyticon Discovery with 46 clusters, which is simply due to the nature of this database. We used the NatDiverse database of Analyticon Discovery in which one natural product scaffold can be used to synthesise 500-1500 compounds.

CONCLUSION

We have developed a system to easily manage screening sets. The libraries from 32 providers have been inserted in our database, and the system allows adding new compounds from providers’ updates. All the compounds in the database are flagged as drug-like or lead-like, and personalised rules can be defined to extract screening sets.
We are currently working on Screening Assistant a GUI interface to the code used for this study . The next step of this work is to compute 3D structures for all the compounds in the database. Furthermore QSPR models of Caco-2 permeation and water solubility is planned to be added. The last step will be the implementation of the automatic screening using QSAR and docking prediction models.


REFERENCES

1. The IUPAC International Chemical Identifier Project. http://www.iupac.org/projects/2000/2000-025-1-800.html.
2.
Mozziconacci, J. C. ; Arnoult, E. ; Baurin, N. ; Marot, C. ; Morin-Allory, L. Preparation of a molecular database from a set of 2 million compounds for virtual screening applications : gathering, structural analysis and filtering. 9th Electronic Computational Chemistry Conference (ECCC9), 03-2003 - Internet and World Wide Web .
3.
J. K. Wegner. JOELib. http://joelib.sourceforge.net/.
4.
Groupement De Service Chimiothèque Nationale. http://chimiotheque.ujf-grenoble.fr/.
5. Reynolds, C. H.; Druker, R.; Pfahle, L. B. Lead Discovery Using Stochastic Cluster Analysis (SCA): A New Method for Clustering Structurally Similar Compounds. J. Chem. Inf. Comput. Sci. 1998, 38, 305-312.
6. Voigt, J. H.; Bienfait, B.; Wang, S.; Nicklaus, M. C. Comparison of the NCI Open Database with Seven Large Chemical Structural Databases J. Chem. Inf. Comput. Sci. 2001, 41, 702-712.
7. Xue, L.; Godden, J.W.; Bajorath, J. Database Searching for Compounds with Similar Biological Activity Using Short Binary Bit String Representations of Molecules. J. Chem. Inf. Comput. Sci. 1999, 39, 881-886.
8. Bemis, G.W.; Murcko, M.A. The Properties of Known Drugs. 1. Molecular Frameworks. J.Med.Chem 1996, 39, 2887-2893.
9. Lipinski, C.A.; Lombardo, F.; Dominy, B.W.; Feeney, P.J. Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv. Drug Deliv. Rev. 1997, 23, 3-25.
10. Lipinski, C.A. Lead- and drug-like compounds: the rule-of-five revolution. Drug Discov. Today 2004, 1, 337-341.
11. Sadowski, J.; Kubinyi, H. A scoring scheme for discriminating between drugs and nondrugs. J.Med.Chem 1998, 41, 3325-3329.
12. Ajay, A; Walters, W.P.; Murcko, M.A. Can we learn to distinguish between "drug-like" and "nondrug-like" molecules? J.Med.Chem 1998, 41, 3314-3324.
13. Murcia-Soler, M.; Pérez-Giménez, F.; Garcý´a-March, F.J.; Salabert-Salvador, M.T.; Diaz-Villanueva, W.; Castro-Bleda M.J. Drugs and Nondrugs: An Effective Discrimination with Topological Methods and Artificial Neural Networks. J. Chem. Inf. Comput. Sci. 2003, 43, 1688-1702.
14. Oprea, T.I. Property distribution of drug-related chemical databases. J. Comput. Aided Mol. Des. 2000, 14, 251-264.
15. Rishton, G.M. Reactive compounds and in vitro false positives in HTS. DDT 1997, 2, 382-384.
16. Hann, M. M.; Oprea, T. I. Pursuing the leadlikeness concept in pharmaceutical research. Curr Opin Chem Biol 2004, 8, 255-263.
17. http://www.daylight.com/dayhtml/doc/theory/theory.smarts.html.
18. Charifson, P.S.; Walters W.P. Filtering databases and chemical libraries J. Comput. Aided Mol. Des. 2002, 16, 311-323.
19. http://www.univ-orleans.fr/icoa/screeningassistant/.
20.
Wenlock, M.C.; Austin, R.P.; Barton, P.; Davis, A.M.; Leeson P.D. A Comparison of Physiochemical Property Profiles of Development and Marketed Oral Drugs J. Med. Chem. 2003, 46, 1250-1256.