Chimeric Molecules as Adversarial Training Examples for Machine Learning
The application of machine learning techniques to cheminformatics data has accelerated tremendously in recent years. In particular, deep learning architectures show considerable promise in being able to extract patterns and features from large datasets that aren’t readily obvious to human experts or apparent with hand-crafted features. With this larger modeling capacity comes greater potential for machine learning models to “cheat” and identify trivial hyperplanes separating class labels, rather than learning generalizable physical-based properties of the dataset. An example of such a hyperplane would be where a model learns that a particular functional group, for example a terminal sulfonylamide, is always associated with binding affinity---regardless of where that functional group is placed on the molecule. The use of such a model in a predictive setting leads to an overabundance of predictions favoring that functional group without regard to the molecular scaffold or the presence of other substituents. We have devised a general scheme for combatting this form of bias. Using a fragment-based genetic algorithm, we take any compound in our training set and form a set of scrambled compounds which contain the parts of this compound, plus a fraction of new fragments. These compounds are used as decoys during model training to mitigate bias due to any one fragment. We show here that the use of these decoys reduces undesirable functional group bias in model predictions using the AtomNet® structure-based architecture and proteochemometric models. They also reduce the occurrence of favorite molecules that are predicted to bind strongly, regardless of the target protein. The algorithmic approach to forming these chimeric molecules is general and can be readily adapted to condition models in other ligand-based modeling contexts such as the prediction of ADMET properties.