I see that a new version of ChEMBL has been released. Chembl 28
- 2,680,904 compound records
- 2,086,898 compounds (of which 2,066,376 have mol files)
- 17,276,334 activities
- 1,358,549 assays
- 14,347 targets
- 80,480 documents
The latest release of the essential molecule bioactivity dataset has just been announced.
ChEMBL 26 contains
- 2,425,876 compound records
- 1,950,765 compounds (of which 1,940,733 have mol files)
- 15,996,368 activities
- 1,221,311 assays
- 13,377 targets
- 76,076 documents
A couple of notes
We are now using RDKit for almost all of our compound-related processing. For the first time in ChEMBL26, this will include compound standardization, salt-stripping, generation of canonical smiles, structural alerts, image depiction, substructure searches and similarity searches (via FPSim2: https://github.com/chembl/FPSim2). Therefore, all molecules have been reprocessed and you may notice some differences in molfiles, smiles and structure search results compared with previous releases. The ChEMBL structure curation pipeline has been released as an open source package: https://github.com/chembl/ChEMBLStructure_Pipeline, and incorporated into our Beaker web services (see below). More information can be found here: http://chembl.blogspot.com/2020/02/chembl-compound-curation-pipeline.html.
We are also now using ChemAxon tools to calculate most acidic and basic pKa, logP and logD (pH 7.4) predictions, rather than ACDLabs software. These properties have therefore been recalculated and renamed in the database.
Off-target activity is often ignored and might only be uncovered relatively late in the drug discovery program. Whilst broad spectrum screening is available it can be rather expensive. Predicting potential off-target activities is an attractive approach and this paper describes the development of a prediction tool using nearest neighbours combined with machine learning.
The Polypharmacology Browser PPB2: Target Prediction Combining Nearest Neighbors with Machine Learning DOI
To build PPB2 we collected a bioactivity dataset of all compounds having at least IC50 < 10 uM on a single protein target in ChEMBL22 considering only high confidence data points as annotated in ChEMBL and only targets for which at least 10 compounds were documented
You can try it out here PPB2., depending on the model chosen the results are calculated in a couple of minutes, but don't post your proprietary molecules. Typical results are shown below, clicking on the green "Show NN" button shows the most similar structures.