reading notes of《Artificial Intelligence in Drug Design》
Historically, assay definitions and upload procedures were often setup in a way to allow direct consumption of the data by the requesting research project, but not with further usage in mind.
In March 2016 a publication appeared by a consortium of scientists that outlined four foundational principles—Findability, Accessibility, Interoperability, and Reusability—abbreviated as FAIR principles, that describe procedures to a FAIRification process.
Close communication with experimentalist is of utmost importance for the data scientist in data preprocessing state.
An assay is composed of four components: a biological or physicochemical test system, a detection method, the technical infrastructure, and finally data analysis and processing.
It might be necessary to cleave off the leaving groups of prodrugs in case the experimental property is determined for the pharmacologically active substance.
Inconsistent hydrogen treatment may result in differing descriptor values.
Most of the descriptor packages cannot cope with stereochemistry anyhow, and therefore stereocenters are flattened.
One may, in case of modeling of target affinity profiles, additionally apply structure filters on frequent hitters like PAINS or “Hit Dexter” to avoid noise due to unspecific binding data.
the European Union-funded consortium IMI MELLODDY as part of the innovative medicine initiative (IMI) has developed and published an end-to-end open source tool for the process described under the name MELLODDY_tuner . The tool is used to standardize the data needed for the project to succeed in the endeavor of federated and privacy-preserving machine learning to leverage the world’s largest collection of small molecules with known biochemical or cellular activity to enable more accurate predictive models and increase efficiencies in drug discovery.
Combination of data from different sources poses further challenges, an alternative is to establish a multitask ML model that predicts the values of one assay and uses the other variant as a helper task.
There are three categories of data that require curation:
Random Forests have long been the method of choice in Bayer’s ADMET platform for several reasons:
The advantages of deep neural networks in Drug Discovery appear to be that
Here are classification scheme from Wikipedia, which defines five main classes:
1D-descriptors (i.e. list of structural fragments, fingerprints). Our work-horse descriptors for more than a decade, confirmed by many publications, are circular extended connectivity fingerprints (ECFP), which encode properties of atoms and their neighbors into a bit vector of certain topological (numbers of bonds to starting atom) radius and feature type (element, function as donor, acceptor, etc., atom type).
2D-descriptors (i.e. graph invariants). Graph invariant 2D-descriptors like topology or connectivity indices at least in our hands often yield overfitted models that work well in cross-validation but are not predictive on external test sets.
3D-descriptors (such as, for 3D-MoRSE descriptors, WHIM descriptors, GETAWAY descriptors, quantum-chemical descriptors, size, steric, surface and volume descriptors). Main issue is their dependence on the conformation which introduces ambiguities and noise.
4D-descriptors (such as those derived from GRID or CoMFA methods, Volsurf). These approaches have the additional limitation of being dependent on the alignment of the ligands which is sometimes not obvious and only possible in the case of congeneric series.
There are now public databases and model repositories on compound collections such as QsarDB, Danish (Q)SAR Models database, QSAR toolbox etc.
Actually, the perception of learning the optimal representation directly from the molecule is not exactly correct, because the SMILES or InCHi typically used as structure input is already an abstract representation (i.e. a reduction) of the molecule.
Winter et al. applied the autoencoder-autodecoder concept to learn a fixed set of continuous data-driven descriptors, the CDDD, by transforming random SMILES to canonical SMILES during training. The resulting descriptor is based on approximately 72 million compounds from the ZINC and PubChem databases. The validity of the approach was tested by model performance on eight QSAR datasets and by application to virtual screening. It showed similar performance to various human-engineered descriptors and graph-convolutional models.
Machine learning problems that are concerned with the reactivity of atoms like reaction rates and regioselectivity, pKa values, the prediction of the metabolic fate, or hydrogen bonding interactions require encoding of the properties of the atoms and their surrounding into specialized atom descriptors. In many applications the descriptor values are directly retrieved from quantum chemical calculations.
There are also examples of well-performing classical neighborhood encoding atom descriptors for SoM prediction and regioselectivity in Diels-Alder reactions.
Common metrices for regression models include R2, root mean square error (RMSE), and Spearman’s rho
Common metrices for assessing the quality of classification models are derived from the Confusion Matrix, also named Contingency Matrix.
In case of highly imbalanced datasets, the accuracy can be misleading e.g. if the model always predicts the higher populated class, it will get a high accuracy without being predictive.
Specificity or true negative rate, is the proportion of observed negatives that are predicted as such while sensitivity, also called true positive rate or recall, is the proportion of observed positives that are predicted correctly. Another metric focusing more on the predictions than the observed values is the positive predictive value, also called precision, which shows the proportion of correctly predicted positives out of all predicted positives. F-Score which is the harmonic mean of precision and sensitivity.
The Matthews correlation coefficient (MCC) is the geometric mean of the regression coefficient and is also suitable for classification problems with imbalanced class distributions.
Cohen’s kappa is also a good measure that can handle imbalanced class distributions and shows how much better the classifier is compared to a classifier that would guess randomly according to the frequency of each class.
Another popular metric is the receiver operation characteristic (ROC) graph to visualize the performance of the classification algorithm. The area under the ROC curve (ROC AUC) is the numerical metric used to describe the ROC curve.
The process accepted as best practices for machine learning which developed over the last 20 years and that is now generally applied is described in current reviews and outlined in detail in the respective OECD guideline.
Validation strategies broadly applied are cross-validation, bootstrapping and Y-scrambling.
By sharing of parameters in (some of) their hidden layers between all tasks multitask neural networks force the learning of a joint representation of the input that will be useful to all tasks.
The main advantages of multitask learning are
Multitask effect are highly dataset-dependent which suggests the use of dataset-specific models to maximize overall performance.
The work of author is that the multitask graph convolutional network performed on par or better than the single-task graph convolutional network and outperformed single-task random forests or neural networks with circular fingerprint descriptors, especially in the case of solubility, where the improvement was break-through.
Any oral drug first passes the liver before entering the rest of the body. Metabolic transformations occur in two phases. In phase I, mostly cytochrome P450 enzymes increase polarity by oxidative and reductive reactions. In phase II, a plethora of enzymes like UDP-glucuro-nosyltransferase, sulfo-transferases, or glutathione S-transferases conjugate specific fragments to the phase I metabolites for renal excretion.
The high effort and limitation in experimental assay capacity for the identification of SoMs has led to many computational approaches over the last 20 years, applying docking, molecular dynamics, quantum chemistry calculations, and machine learning, with and without incorporation of protein target information. The reader is referred to Kirchmair et al. for a comprehensive overview over experimental and computational approaches.
The lability of atoms with regards to metabolic reactions is determined by their chemical reactivity, i.e. the local electron density and the steric accessibility of the respective atoms, necessitating atomic descriptors instead of molecular ones, as well as machine learning for atoms instead of molecules.
The complex multi-parameter optimization necessary to find the best compromise of many optimization parameters is a key challenge in drug discovery projects.
In this section, we sketch a project situation from 2016, where we exemplarily show how to tackle the prioritization of compounds from a large virtual chemical space by combining cheminformatics and physics-based approaches.
The willingness to share more of those data in conjunction with the use of block-chain technologies enables the privacy preserving exchange of data among many pharmaceutical companies, increasing the data basis for models by several orders of magnitude.
Some of those machine learning models have reached the quality to significantly reduce or halt experimental measurements. However, this is not valid for all endpoints and despite the availability of large homogenous datasets some ADMET endpoints can still not be modeled with sufficient quality. Phys-chem/ADMET properties as well as chemical synthesizability are mainly modeled with data-based approaches such as machine learning.
For pharmacological endpoints, the data are sparse and only a smaller fraction (typically <30%) of a larger diverse drug target portfolio will be covered by ML models with sufficient predictivity. Pharmacological activity is often addressed with protein-structure based approaches.