ANALYSIS
OF A SET OF 2.6 MILLION UNIQUE COMPOUNDS GATHERED FROM THE LIBRARIES
OF 32 CHEMICAL PROVIDERS
Aurélien
Monge, Alban Arrault, Christophe Marot and Luc Morin-Allory

ABSTRACT
3.8
million of compounds from structural databases of 32 providers has
been gathered and stored in our database. Once the duplicates are
removed using the InChI [1], 2.6 million of
compounds remain. The 32 databases and the whole database were studied
in term of uniqueness, diversity, frameworks, drug-like and lead-like
properties. This study shows that there are more than 87 000 frameworks
in our database. There are 1.9 million of drug-like molecules among
which more than 900 000 are lead-like. The druglikeness and
leadlikeness are estimated using in house scores using function to
estimate convenience to properties rather than cut-off values. The
compounds are stored in a MySQL database and the code to manage this
database is in Java. In consequence, we have a free and easily
updatable system for chemical databases management and screening sets
generation.
Keywords:
Chemoinformatics, chemical databases, screening, drug-like, lead-like,
fingerprints, diversity.
CONTENTS
In a project of virtual or real
screening, choosing the molecules to test is a very important step.
According to the targeted number of molecules one or several
appropriated providers which databases are the most representative of
the whole available compounds must be chosen. Moreover the lead-like
and drug-like compounds must be identified.
Our laboratory has already combined databases
[2].
We have improved the management system of our resulting database using
a database management software. We choose to use free tools such as
MySQL and Java. Our system offers now an easy way to update the whole
database using update files from providers.
The database contains now 2.2 million of unique compounds gathered
from 32 providers. We used our system to analyse providers’
databases. The uniqueness, diversity, frameworks, druglikeness,
leadlikeness and the distribution of druglikeness related properties
was studied.
The database server MySQL is used on
a Linux PC. All the code is programmed in Java except one method coded
in C, using Java Native Interface (JNI), allowing launching batch
operations on Windows in a proper way. For all the operations on
chemical structures and for the calculation of descriptors the JOELib
[3] Java API was used.
The structural data (SDF, SMILES, InChI...) was separated from
descriptors in order to speed up SQL queries on descriptors.
UNIQUENESS
Each imported compound have, if
necessary, its counter-ion removed and is reprotonated. This new
structure is added if it’s not already present in the database.
To characterise the chemical structures we choose to use the IUPAC
International
Chemical Identifier (InChI)
[1]. InChI is free and
conceived in the perspective to become a standard unique code for
molecules. This code is represented by a code in one text line where
the sp2 and sp3 stereochemistry, isotopes and tautomerism are
represented.
One of the reasons of the presence of duplicates in databases is
undefined or badly defined stereochemistry. We also want to note that
counter-ions are not taken into account to check duplicates.
We use the 1.12 Beta version of InChI which only have a basic support
for aromatics bonds because this bond type is specified to be only for
query in the MDL mol format description. As JOELib codes some
structures using the aromatic bond type, these structures can not have
an InChI code computed and are considered as unique. The structures
without unique codes represent less than 2 % of the database, and the
percentage of duplicates due to this problem must be very low. The
support of aromatic bond type was improved in the next versions.
The database is also conceived to store a list of all the providers for
a given compound. When a molecule is inserted, if this compound is in
the database but with a different provider, the structure is not added.
However, if the provider of this inserted molecule is not in the
providers list of this molecule, the provider is added to this list.
Moreover, if two molecules in a same file are detected as duplicates,
this information is kept allowing to count internal duplicates in
providers’ databases. However if two molecules from two different
files of a same provider are detected as duplicates they are not
considered as internal duplicates. In this case no information is kept
about the second molecule added. This choice was done to update the
database in an efficient way. Indeed, the providers can release new
compounds in an updated file containing all the former molecules, and
these molecules shouldn’t obviously be considered as internal
doubloons.
To check the presence of a structure in the database, the MD5 hashcode
of the InChI of this molecule is compared to the indexed MD5 hashcodes
of the InChI of all the compounds in the database. If two
structures have the same MD5 hashcode their InChI code are compared to
check if they are actual duplicates even if the probability that two
structures got the same MD5 hashcode but not the same InChI code is
very low (there is 16
32 possibility for MD5 hashcodes so the
collision probability is very low). Using MD5 hashcode allows to
research duplicates in a smaller table with fixed size rows, which is
more effective.
DATABASES
The databases have been gathered on the web or obtained by
collaboration. We have obtained the databases of 32 providers (table 1).
Table
1 : List of providers with
number of molecules present in the database. (* internal
duplicates are not evaluated)
The database files of these 32
providers have a total of 3.8 million molecules. These compounds
stemmed
from classical organic synthesis, combinatorial chemistry or natural
compounds extraction. The ICOA database is our corporate database, and
is included in the French “Chimiotheque nationale” (Chem.
Nat.) which gathers the databases of 17 French public laboratories
[4].
DIVERSITY
We have used the dissimilarity step
of the Stochastic Clustering Analysis (SCA)
[5] to
identify the number of clusters like it was done in a study of the NCI
database
[6]. As the clusters are created by
diversity, the number of clusters gives us information about the
diversity of the database. The descriptors used and stored in the
database are the SSKey-3DS
[7] fingerprints. The
SSKey-3DS is constituted of 32 bits coding for the presence or absence
of 32 fragments, and 22 bits which encode numerical values H bond
acceptors, aromatic bonds, and fraction of rotatable bonds. We have
programmed this algorithm in Java. The number of clusters of the whole
database is investigated within one hour on a standard PC.
FRAMEWORKS
In a study of shape of drug-like
compounds, Bemis used the notion of graph frameworks corresponding to
ring systems connected to each other by linkers
[8].
To obtain the graph framework of a hydrogen depleted structure of a
compound, all the atoms of the molecule are replaced by
“non-typed” atoms and all the bonds are replaced by
“non-typed” bonds. Then, all atoms connected to only one
other atom are removed. This step is repeated until no atom is deleted
(figure 1).
Figure 1: visualisation of the framework algorithm: 1)
hydrogens are removed, 2) atoms with only one bond are removed, this
step is repeated until it only remains atoms with two bonds or
more, 3) all atoms’ types are set to C, and all bonds’ type
except aromatic are set to single.
In our implementation of this
algorithm, non-type atoms are replaced by C atoms and non-type bonds
are represented by single bonds, but unlike the Bemis method the
aromatic bonds are differentiated of the non aromatic bonds. The
advantage of this representation is that it can be stored as a
structure, and represented with a molecular viewer. Furthermore we can
compute InChI for the frameworks and store only unique frameworks.
DRUG-LIKE AND LEAD-LIKE PROPERTIES
Lipinski
[9]
rules are the most widely used to identify drug-like compounds
[10]. Other techniques based on artificial neural
networks have also been used
[11],
[12],
[13].
First of all, compounds with atoms C, O, N, S, P, F, Cl, Br, I, Na, K,
Mg, Ca, or Li are flagged. Next, we have used filters described in
another study of our laboratory
[2]:
- 100 ≤ molecular weight ≤ 800
g.mol-1
- logP ≤ 7
- H donors ≤ 5
- rotatable bonds ≤ 15
- no reactive functions (eliminate
false
positives)
- halogen atoms ≤ 7
- alkyl chains ≤ -(CH2)6CH3
- no perfluorinated chains: -CF2CF2CF3
- rings ≤ 6
- no big size ring with more than 7
members
- at least one N or O atom
The definition of the reactive functions used is the
modified version by Oprea [14] of the list
published by Rishton [15].
A
recent review of Hann and Oprea gives rules to select lead-like
molecules [16]. In our database a compound is
lead-like if it is drug-like, with H bond acceptor (HBA) ≤
9, molecular weight (MW) ≤
460, -4 ≤
logP ≤
4.2, rotatable bonds ≤
10, and smallest set of smallest rings (SSSR)
≤
4. The following definitions are used:
- HBA :
SMARTS [17] code :
[$([#7,#8,#15,#16]);!$([o,s,nX3,#7v5,#15v5,#16v4,#16v6])]
- HBD : SMARTS code : [!$([#6,H0,-,-2,-3])]
- Rotatable bonds: The definition of JOELib [3]
is used « Number of rotatable bonds, where the atoms are
heavy atoms with bond orders one and a hybridization which is not one
(no sp). Additionally the bond is a non-ring-bond .»
On the basis of this rules, we have
evaluate drug-like and lead-like properties using two methods. The
first method counts for the number of non-fitted criteria, the second
method computes a progressive score based on these criteria.

HBD: the
intermediate penalty zone stretches from 3.5 to 6.5 (former cut-off :
5).
|

HBA: the
intermediate penalty zone extends from 7 to 13 for the druglikeness
(former cut-off : 10) and from 6.3 to 11.7 for the leadlikeness (former
cut-off : 9).
|

Number of
rotatable bonds: the intermediate penalty zone extends from 10.5
to 19.5 for the druglikeness (former cut-off: 15) and from 7 to 13 for
the leadlikeness (former cut-off: 10).
|

Number of SSSRs:
the intermediate penalty zone stretches from 4.2 to 7.8 for the
druglikeness (former cut-off: 6) and from 2.8 to 5.2 for the
leadlikeness (former cut-off: 4).
|

Maximum ring
size: the intermediate penalty zone extends from 6 (6-member
rings have no penalty) to 9.1 (former cut-off : 7).
|

Number of
halogens: the intermediate penalty zone stretches from 4.9 to
9.1 (former cut-off : 7).
|

Molecular
Weigth: the lower intermediate penalty zone extends from 100 to
150 (based on the marketed drug weight distribution [20]).
The upper intermediate penalty zone stretches from 350 to 800 for the
druglikeness (500 – 30 % and the former limit 800 is kept because
it was already very permissive) and from 322 to 588 for the
leadlikeness (former cut-off: 460 [15]).
|

LogP:
the lower intermediate penalty zone extends from -5 to -1.5, the upper
intermediate penalty zone stretches from 4.5 to 7.5 for the
druglikeness (based on the marketed drug logP distribution [20]) and from 2.9 to 5.5 for the leadlikeness
(former cut-off : 4.2 [15]).
|
Figure 2: details
of the penalty functions.
Linear equations for each criterion
are used to compute a drug-like
score. For each properties the minimal penalty is 0 (no penalty) and
the maximum penalty is 1 (figure 2). We tolerate one unsatisfied
property and in consequence molecules with a drug-like score ≤
1 are considered as drug-like. Some of these properties add 2 to the
score and then make the compound non drug-like if only one of these
properties isn’t satisfied, even if all the other properties are
satisfied. These properties are:
- the presence of a reactive function
- the presence of a single chain > -(CH2)6CH3
- the presence of a perfluorinated chain
- no O or N atom
Therefore a molecule with a score ≥
2 is definitively non drug-like. This method has two advantages against
the sum of the number of unsatisfied drug-like criteria. Firstly, a
compound with a property value slightly superior to a given value is
not eliminated but is given a score penalty. Secondly this method
allows having a progressive value of the druglikeness which allows
sorting the compounds.
The lead-like score is designed exactly like the drug-like score but
using the lead-like properties when they differ from the drug-like ones.
The selection of compounds with nitro
group is not recommended, because
the nitro group can cause false positives
[18].
Although nitro group is not in our default frequent hitters list, all
nitro compounds are flagged and can easily be removed if wanted.
We want to highlight that it
doesn’t exist absolute lead-like and
drug-like rules. It’s depends mainly of the project and of the
type of test used. Although we have chosen parameters for each of these
rules our system allows changing them easily to extract a new dataset
of compounds. In addition to the classical parameters, we can eliminate
from the dataset molecules with unwanted substructures.
DISTRIBUTION
The four providers with the greatest number of available compounds
(figure
3) are ChemDiv, InterBioScreen, ChemBridge and Enamine. However the
originality of the structures of each provider must also be assessed in
order to compare them.
Figure 3: Distribution
of the compounds of the whole database by providers.
First
of all we must notice that some bases can be, in part or in full, the
compilation of other databases. The ICOA and Chem. Nat. databases have
been treated in a different way to the other databases in this study,
because we know that the ICOA database (our corporate database) is
included in the Chem. Nat. database. In consequence, for each of these
two databases the molecules in common are not considered as duplicates
(figure 4).

Figure 4: for each provider, percentage of
compounds not presents in others providers’ databases.
Biofocus (100 %), Analyticon
Discovery (99.96 %), ACB Blocks (97.9 %), Tripos (97 %) and ICOA (92.2
%) have the highest percentages of original compounds. Except for the
very big databases, there is no direct relationship between the
databases sizes and the percentages of original compounds. Indeed,
Analyticon Discovery and ICOA are relatively small with respectively
5438 and 2811 compounds, but ACB Blocks and Tripos are larger with
61237 and 82370 compounds. The four biggest databases have between 36
and
85 % of original compounds.
Figure 5: percentage of internal duplicates for
each providers.
The databases
Sigma-Aldrich (9.3 %), NCI (6.0 %), MDPI (5.3 %) and Arkive (3.6 %)
have the biggest percentages of internal duplicates (figure 5). No
duplicates were found in the databases of ACB Blocks, AnalytiCon
Discovery, BioFocus and Tripos. The size of the library is definitively
not linked to the number of internal duplicates. The best example is
the ChemDiv library which is the biggest library and has only 0.02 % of
internal duplicates.
DIVERSITY
The chemical space covered by a database is essential information. We
used the dissimilarity step of the SCA algorithm with SSKey-3DS
fingerprints to compute the number of clusters for the whole database
and for each provider (figure 6). The NCI database is clearly the most
representative of the chemical space and covers 59 % of the chemical
space of the whole database. However this database can’t be
considered as a commercial database. After the NCI, Enamine (37 %),
ChemDiv (36 %), InterBioScreen (35 %), Sigma-Aldrich (35 %), ChemBridge
(34 %) are the databases which are the most representative of the
global diversity. The less representative database is ArrayBioPharma
(0.5 %), which is normal because this is the smallest database (517
compounds).
Figure 6: diversity against the whole
database
for each provider.
We expect that the
biggest databases should also be the most diverse. We have studied the
relationship between the number of compounds in a database and the
diversity (number of clusters) of this database. We can see, in figure
7, a rapid linear increase for the databases with less than 100 000
compounds, then for the databases of more than 150 000 compounds the
increase of the diversity is slower. The NCI with 10 623 clusters for
250 000 compounds is an outlier.

Figure 7: increase of the diversity with the
size of the databases.
Figure 8: % of the whole database frameworks
represented for each providers.
87 000
frameworks were found in the whole database. The figure 8 shows the
percentage of the frameworks of the whole database for each provider.
Unlike the results obtained by the diversity study, the NCI is not the
most representative of the whole database but comes in fifth position
(19 %). Enamine is the first (33 %) followed by ChemDiv (26 %) and
InterBioScreen (23 %). Among the commercial databases, the three with
the most important number of frameworks are also the most diverse in
the previous part.
The three databases with the most representative frameworks are also
the three most diverse, in the same order. The less representative
databases are ArrayBioPharma (0.02 %), Chemical Block (0.29 %) and
Prestwick (0.39 %). We can see in figure 9 that the number of
frameworks is highly correlated to the size of the databases with a
R² = 0.89; this correlation explains the previous results.

Figure 9: linear regression to link the number
of frameworks to the number of compounds (R² = 0.89).
DRUG-LIKE
We have studied the "drug-like"
properties of the bases with two approaches. A classic one which, for
each product, computes the number of violations of the limits of the
rules, the second one which uses the score presented in the previous
section.
For each provider, the numbers of molecules with 0, 1, 2 and more than
2 drug-like failures are represented figure 10.
Figure 10: percentage of
drug-like failures for each provider’s database.
All the libraries
have a high ratio of molecule of with 0 or 1 drug-like failures. The
library with the lower percentage of molecules with none drug-like
failures is Tripos with 55 %. ACB Blocks (91 %), Aurora (89 %),
InterBioScreen (87 %), Chemical Block (87 %) are the libraries with
highest percentages of compounds without drug-like failures. Among the
libraries only Array BioPharma has 0 % of molecules with 2 or more
drug-like failures.
The other method to estimate drug-like properties is drug-like score
(figure 11).

Figure 11: percentage of drug-like scores for
each provider’s database.
If we consider as
“drug-like” the compounds with a drug-like score ≤
1, the providers with the largest percentage of drug-like compounds are
ACB Blocks (88 %), Chemical Block (88 %) and Aurora (87 %).The
distribution of the drug-like score in the whole database is shown in
figure 12.
Figure 12: drug-like score distribution.
The relative
importance of the drug-like filters is shown in figure 13. Much of the
compounds (14 %) are removed because of reactive functions.

Figure 13: influence of drug-like filters.
LEAD-LIKE
The lead-like failures are represented in figure 14.
Figure 14: percentage of lead-like failures for
each provider’s database.
If we consider the molecules without lead-like failures as lead-like,
ChemicalBlock (79 %) and Array BioPharma (73 %) are the most
“lead-like” libraries. Analyticon Discovery (19 %) and
Tripos (22 %) are the databases with the fewer lead-like compounds.
These results are coherent with the lead-like score presented in figure
15.
Figure 15: percentage of drug-like scores for
each provider’s database.
If we consider the compounds with chemical scores ≤
1 as lead-like, the conclusions are the same than selecting compounds
with none lead-like failures. ChemicalBlock (80 %) and Array BioPharma
(76 %) have the largest percentage of lead-like compounds and
Analyticon Discovery (10 %) and Tripos (20 %) have the fewer lead-like
compounds.
We can see in figure 15 that the distribution of the lead-like score is
linearly progressive on the whole database. In consequence, this
function can be very useful to sort compounds by leadlikeness.
Figure 15: lead-like score
distribution.
The figure 16 shows the logP filter is
the most selective of the lead-like filters and removes 48 % of the
compounds.
Figure 16: influence of drug-like filters.
DIVERSITY IN THE LEAD-LIKE SPACE
We have already compared the chemical
space covered by each database counting the number of clusters created
by diversity in a database. However the diversity in a database can be
added by compounds which aren’t drug-like. We present here a
second study of the coverage of the diversity space by the databases,
but this time we have limited our study to the lead-like space (figure
16).
Figure
16: lead-like
space of the whole database covered by each provider.
The
NCI (6423 clusters) is the first of this sorting, next come
InterBioScreen (4047 clusters), Chembridge (4042 clusters), ChemDiv
(3888 clusters), Enamine (3880 clusters) and Sigma-Aldrich (3720
clusters). In figure 6, the more diverse were NCI, Enamine, ChemDiv,
InterBioScreen, Sigma-Aldrich and Chembridge. So we can see that the
sorting of the databases by diversity is dependent of the chemical
space studied. The last database of the sorting is Analyticon Discovery
with 46 clusters, which is simply due to the nature of this database.
We used the NatDiverse database of Analyticon Discovery in which one
natural product scaffold can be used to synthesise 500-1500 compounds.
We
have developed a system to easily manage screening sets. The libraries
from 32 providers have been inserted in our database, and the system
allows adding new compounds from providers’ updates. All the
compounds in the database are flagged as drug-like or lead-like, and
personalised rules can be defined to extract screening sets.
We are currently working on Screening Assistant a GUI
interface to the code used for this study . The next step of this work
is to compute 3D structures for all the compounds in the database.
Furthermore QSPR models of Caco-2 permeation and water solubility is
planned to be added. The last step will be the implementation of the
automatic screening using QSAR and docking prediction models.
1. The IUPAC International Chemical
Identifier Project.
http://www.iupac.org/projects/2000/2000-025-1-800.html.
2. Mozziconacci, J. C. ; Arnoult, E. ;
Baurin, N. ; Marot, C. ; Morin-Allory, L. Preparation
of a molecular database from a set of 2 million
compounds for
virtual screening applications : gathering, structural analysis and
filtering. 9th
Electronic Computational Chemistry Conference (ECCC9), 03-2003
- Internet and World Wide Web .
3. J.
K. Wegner. JOELib.
http://joelib.sourceforge.net/.
4. Groupement De
Service Chimiothèque Nationale.
http://chimiotheque.ujf-grenoble.fr/.
5. Reynolds,
C. H.; Druker,
R.; Pfahle, L. B. Lead Discovery Using Stochastic Cluster Analysis
(SCA): A New
Method for Clustering Structurally Similar Compounds. J. Chem. Inf.
Comput.
Sci. 1998, 38, 305-312.
6. Voigt,
J. H.; Bienfait,
B.; Wang, S.; Nicklaus, M. C. Comparison of the NCI Open Database with
Seven
Large Chemical Structural Databases J. Chem. Inf. Comput. Sci. 2001,
41, 702-712.
7. Xue,
L.; Godden, J.W.;
Bajorath, J. Database Searching for Compounds with Similar Biological
Activity
Using Short Binary Bit String Representations of Molecules. J.
Chem. Inf.
Comput. Sci. 1999, 39, 881-886.
8. Bemis,
G.W.; Murcko,
M.A. The Properties of Known Drugs. 1. Molecular Frameworks. J.Med.Chem
1996,
39, 2887-2893.
9. Lipinski,
C.A.;
Lombardo, F.; Dominy, B.W.; Feeney, P.J. Experimental and
computational approaches to estimate solubility and permeability in
drug
discovery and development settings. Adv. Drug Deliv. Rev. 1997,
23,
3-25.
10. Lipinski,
C.A.
Lead- and drug-like compounds: the rule-of-five revolution. Drug
Discov.
Today 2004, 1, 337-341.
11. Sadowski,
J.; Kubinyi,
H. A scoring scheme for discriminating between drugs and nondrugs. J.Med.Chem
1998, 41, 3325-3329.
12. Ajay,
A; Walters, W.P.;
Murcko, M.A. Can we learn to distinguish between "drug-like" and
"nondrug-like" molecules? J.Med.Chem 1998, 41,
3314-3324.
13. Murcia-Soler,
M.;
Pérez-Giménez, F.; Garcý´a-March, F.J.;
Salabert-Salvador, M.T.;
Diaz-Villanueva, W.; Castro-Bleda M.J. Drugs and Nondrugs: An Effective
Discrimination with Topological Methods and Artificial Neural Networks.
J.
Chem. Inf. Comput. Sci. 2003, 43, 1688-1702.
14. Oprea,
T.I. Property
distribution of drug-related chemical databases. J. Comput. Aided
Mol. Des.
2000, 14, 251-264.
15. Rishton,
G.M. Reactive
compounds and in vitro false positives in HTS. DDT 1997,
2,
382-384.
16. Hann,
M. M.; Oprea, T.
I. Pursuing the leadlikeness concept in pharmaceutical research. Curr
Opin
Chem Biol 2004, 8, 255-263.
17. http://www.daylight.com/dayhtml/doc/theory/theory.smarts.html.
18.
Charifson,
P.S.; Walters
W.P. Filtering databases and chemical libraries J. Comput. Aided
Mol. Des.
2002, 16, 311-323.
19. http://www.univ-orleans.fr/icoa/screeningassistant/.
20. Wenlock,
M.C.; Austin,
R.P.; Barton, P.; Davis,
A.M.; Leeson P.D. A Comparison of Physiochemical Property Profiles of
Development and Marketed Oral Drugs J. Med. Chem. 2003,
46, 1250-1256.