Deep machine learning successfully captures information about one million bioactive molecules.
August 1, 2021Using deep machine-learning computational models, the Structural Bioinformatics and Network Biology team, lead by ICREA Researcher Dr. Patrick Aloy, has finished the bioactivity information for a million molecules. Additionally, it provided a method for predicting any molecule’s biological activity in the absence of experimental data.
This new methodology is based on the Chemical Checker, the world’s largest database of bioactivity profiles for pseudopharmaceuticals to date, which was produced and published by the same laboratory in 2020. For each molecule, the Chemical Checker collects data from 25 bioactivity areas. These areas are associated with the molecule’s chemical structure, the targets with which it interacts, and the clinical or cellular changes that the molecule generates. However, for the majority of molecules, this extremely precise information on their mechanism of action is insufficient, meaning that information about one or two of the 25 bioactivity spaces may exist for a given one.
The new development integrates all available experimental data with deep machine learning approaches, allowing researchers to complete all activity profiles for all compounds, from chemistry to clinical level.
“The new technique also enables us to estimate the bioactivity spaces of novel compounds, which is critical throughout the drug discovery process since it allows us to choose the most promising candidates and exclude those that would not work for whatever reason,” Dr. Aloy continues.
The software library is provided to the scientific community for free at bioactivitysignatures.org, and it will be updated on a regular basis as new biological activity data become available. Artificial neural networks will also be updated in response to each update of experimental data in the Chemical Checker, to further enhance the estimates.
Reliability of predictions
The model’s projected bioactivity data are more or less reliable based on a number of criteria, including the amount of experimental data available and the molecule’s properties.
Along with predicting biologically relevant properties, the method built by Dr. Aloy’s team also provides a measure of the prediction’s degree of dependability for each molecule. “While all models are incorrect, there are a few that are quite useful! A level of confidence enables a more nuanced interpretation of the results, emphasising which domains of a molecule’s bioactivity are precise and which have a possibility of error “Dr. Martino Bertoni, the work’s original author, adds.
Using the IRB Barcelona compound library to validate the system
To validate the tool, the researchers searched the IRB Barcelona compound library for compounds that might be good drug candidates for modulating the activity of a cancer-related transcription factor (SNAIL1), the activity of which is nearly impossible to modulate due to drug direct binding (it is considered a ‘undruggable’ target). Deep machine learning models predicted properties (in terms of their dynamics, interaction with target cells and proteins, and so on) for 131 compounds that matched the target from a first set of 17,000 chemicals.
The ability of these compounds to degrade SNAIL1 has been proven experimentally, and it has been observed that, for a large proportion of the compounds, this degradation capability is compatible with what the models anticipated, so validating the system.
Reference
Martino Bertoni et al, Bioactivity descriptors for uncharacterized chemical compounds, Nature Communications (2021). DOI: 10.1038/s41467-021-24150-4