Chemistry and Biology Databases

We now have a number of very large virtual libraries of compounds that are available for make on demand, a couple of providers are Enamine 70 billion molecules, Wuxi 200 million molecules, Liverpool ChiroChem 1 billion 3D rich molecules.

Probably the largest database for commercial compounds is eMolecules– “Google for Molecules”. Draw chemical structures using drawing packages such as JME (embedded applet), ISIS/Draw, ChemDraw or ChemSketch, and then instantly search over 8.0 million unique chemical structures from more than 140 leading chemical suppliers. Search results include reference links to properties and spectra from sources such as DrugBank, National Cancer Institute, NIST WebBook, PubChem, EPA and more.

Zinc is a free database of commercially available compounds ideal for virtual screening. One really nice feature is the property defined sub-sets, such as lead-like (Teague, Davis, Leeson, Oprea, Angew Chem Int Ed Engl. 1999 Dec 16;38(24):3743-3748.), drug-like (Lipinski, J Pharmacol Toxicol Methods. 2000 Jul-Aug;44(1):235-49.) etc. These can all be downloaded and searched locally.

MMsINCdatabase a free web-oriented database of commercially-available compounds for virtual screening and chemoinformatic applications. MMsINC contains over 4 million non-redundant chemical compounds in 3D formats. MMsINC is provided by the Molecular Modeling Section in the Department of Pharmaceutical Sciences at the University of Padova, (Italy) in collaboration with the Software Support Services & Development Laboratory (S3D) at the Center for Advanced Studies, Research and Development (CRS4) in Sardinia.

The largest chemical database is PubChem is organized as three linked databases within the NCBI’s Entrez information retrieval system. These are PubChem Substance, PubChem Compound, and PubChem BioAssay. PubChem also provides a fast chemical structure similarity search tool. The database also contains a variety of calculated physicochemical properties for each molecule. Many compounds have links to primary literature or patents and increasingly other databases are providing links to PubChem.

ChemSpider is a free access service providing a structure centric community for chemists. Providing access to millions of chemical structures and integration to a multitude of other online services, ChemSpider is the richest single source of structure-based chemistry information an invaluable source of spectral information. It also hosts a property prediction service.

The Worldwide Protein Data Bank (wwPDB) was formed to maintain a single PDB archive of macromolecular structural data that is freely and publicly available to the global community. It consists of organizations that act as deposition, data processing and distribution centers for PDB data. The wwPDB Partners are: RCSB PDB, PDBj, BMRB, EMDB. Rea

Guide to Pharmacology created in a collaboration between The British Pharmacological Society (BPS) and the International Union of Basic and Clinical Pharmacology (IUPHAR) and now developed jointly with funding from the Wellcome Trust, is intended to become a “one-stop shop” portal to pharmacological information.

BindingDB is a public, web-accessible database of measured binding affinities, focusing chiefly on the interactions of protein considered to be drug-targets with small, drug-like molecules. BindingDB contains 2.9M data for 1.3M Compounds and 9.3K Targets. Of those, 1,397K data for 655K Compounds and 4.5K Targets.

Chem-TCM is the digital database of individual molecules, constituents of plants used in the traditional Chinese herbal medicine. The database consists of four major parts: chemical identification, botanical information, predicted activity against common Western therapeutic targets, and estimated molecular activity according to traditional Chinese herbal medicine categories.

ChEMBL is a database of bioactive drug-like small molecules, it contains 2-D structures, calculated properties (e.g. logP, Molecular Weight, Lipinski Parameters, etc.) and abstracted bioactivities (e.g. binding constants, pharmacology and ADMET data).

The PDBbind database is designed to provide a collection of experimentally measured binding affinity data (Kd, Ki, and IC50) exclusively for the protein-ligand complexes available in the Protein Data Bank (PDB). All of the binding affinity data compiled in this database are cited from original references.

GRAC database is a searchable online database of information from the 5th (2011) edition of the BPS Guide to Receptors and Channels (GRAC) [1], which provides a succinct overview of the key properties of over 1600 established or potential pharmacological targets

Supertarget an extensive web resource for analyzing 332828 drug-target interactions.

Therapeutic Target Database is a database to provide information about the known and explored therapeutic protein and nucleic acid targets, the targeted disease, pathway information and the corresponding drugs directed at each of these targets. Also included in this database are links to relevant databases containing information about target function, sequence, 3D structure, ligand binding properties, enzyme nomenclature and drug structure, therapeutic class, clinical development status. All information provided are fully referenced.

The Centre for Therapeutic Target Validation platform brings together information on the relationships between potential drug targets and diseases. The core concept is to identify evidence of an association between a target and disease from various data types.The Centre for Therapeutic Target Validation is a pre competitive public-private venture that aims to provide evidence on the biological validity of therapeutic targets and provide an initial assessment of the likely effectiveness of pharmacological intervention on these targets, using genome-scale experiments and analysis. The platform currently contains 28,931 targets, 3,049,882 associations for 10,053 diseases.

MACiE, which stands for Mechanism, Annotation and Classification in Enzymes, G. L. Holliday, C. Andreini, J. D. Fischer, S. A. Rahman, D. E. Almonacid, S. T. Williams and W. R. Pearson. Nucleic Acids Research, 40, D783-D789, 2012. Medline ID: 22058127. The current version of MACiE (Version 3.0) contains 335 fully annotated enzyme reaction mechanisms

ChEBI Release 145 is live with 50089 fully annotated entities. ChEBI stands for ‘Chemical Entities of Biological Interest’. It is a freely available database of ‘small molecular entities’, developed at the EBI. The term ‘molecular entity’ encompasses any constitutionally or isotopically distinct atom, molecule, ion, ion pair, radical, radical ion, complex, conformer, etc., identifiable as a separately distinguishable entity.

The Chemical Structure Lookup Service allows you to search through 39 million indexed structures from 80 different databases. Very fast and again you can use a variety of formats (including SMILES), including an embedded java applet, to create the query.

The DrugBank is a richly annotated database of drug and drug target information. It contains extensive data on the nomenclature, ontology, chemistry, structure, function, action, pharmacology, pharmacokinetics, metabolism and pharmaceutical properties of both small molecule and large molecule drugs. As the table below shows the amount of information available in DrugBank has increased considerably since the first version.

CoCoCo is a suite of molecular databases for high throughput virtual screening purposes. CoCoCo collects molecular structural information of commercial compounds from various chemical vendors by providing it in a ready-to-use format. The main characteristic of CoCoCo is to include structural information about conformational states of the compounds.

RxList (www.rxlist.com) provides electronic versions of the FDA’s drug-product data sheets

SkinSensDB: a curated database for skin sensitization assays Skin sensitization is an important toxicological endpoint in drug development and regulatory decision making. Chemical sensitizers act as haptens binding to protein molecules to trigger immune responses that could induce allergic contact dermatitis. To facilitate development of AOP-based computational prediction methods, a novel curated database named SkinSensDB has been constructed by manual curation of published literatures. DOI.

DisGeNET is a discovery platform integrating information on gene-disease associations (GDAs) from several public data sources and the literature doi.

The CTTV platform brings together information on the relationships between potential drug targets and diseases. The core concept is to identify evidence of an association between a target and disease from various data types.

Reframe.db A screening library of 12,000 molecules assembled by combining three databases (Clarivate Integrity, GVK Excelra GoStar and Citeline Pharmaprojects) to facilitate drug repurposing

Superdrug2 is a comprehensive resource for approved/marketed drugs. It contains details of 4,500 active pharmaceutical ingredients annotated with regulatory details, chemical structures (2D and 3D), dosage, biological targets, physicochemical properties, external identifiers, side-effects and pharmacokinetic data.

There is also SwissBioisostere a web service designed to give ideas about potential bioisosteres, this is derived from a matched molecular pair (MMP) analysis of ChEMBL 17. Two different queries are possible: You are interested in a range of possible replacements for a single substructure ( e.g. replacements for an amide group ); or you want to know details about a particular substructural replacement of interest ( e.g. carboxylic acid vs. tetrazole ). Whilst this is very comprehensive it contains a lot of transformations that were never intended to be bioisosteric replacements.

It is also worth noting the Enamine REAL Database The current release of the REAL database comprises over 700 million compounds that comply with “rule of 5” and Verber criteria: MW≤500, SlogP≤5, HBA≤10, HBD≤5, rotatable bonds≤10, and TPSA≤140. This is a database of enumerated synthetically accessible structures.

Structural Databases

The Cambridge Crystallographic Data Centre (CCDC) compiles and distributes the Cambridge Structural Database (CSD), the world’s repository of experimentally determined organic and metal-organic crystal structures.

The Crystallography Open Database (COD) provides open-access collection of crystal structures of organic, inorganic, metal-organic compounds and minerals, currently there are 214780 entries in COD.

GPCRdb contains data, diagrams and web tools for G protein-coupled receptors (GPCRs). Users can browse all GPCR crystal structures and the largest collections of receptor mutants

The RCSB Protein Data Bank contains 226,262 bimolecular structures. Of the 210 drugs registered by the FDA between 2010 and 2016 the molecular targets for 94% of these NMEs are known, the PDB contains 5,914 structures containing one of the known targets and/or a new drug, providing structural coverage for 88% of the recently approved NMEs across all therapeutic areas. DOI.

The Protein Data Bank in Europe contains 226,262 entries

Binding MOAD a subset of the Protein Data Bank (PDB), containing every high-quality example of ligand-protein binding. Hence, we call it the Mother of All Databases (MOAD).

MINICRYST is a Crystallographic and Crystallochemical Database for Minerals and their Structural Analogues

KLIFS is a kinase database that dissects experimental structures of catalytic kinase domains and the way kinase inhibitors interact with them. The KLIFS structural alignment enables the comparison of all structures and ligands to each other. Moreover, the KLIFS residue numbering scheme capturing the catalytic cleft with 85 residues enables the comparison of the interaction patterns of kinase-inhibitors, for example, to identify crucial interactions determining kinase-inhibitor selectivity. DOI.

Worth reading

The NAR Database issue is an annual update of available databases, the focus is largely biological databases but it includes chemistry, toxicology, and target validation resources.

Cambridge MedChem Consulting

Navigation