learning applications and drug discovery is no exception to it. • Deep Learning models typically rely on large amounts of annotated data to be trained which is especially challenging in the context of drug discovery. Motivation
learning applications and drug discovery is no exception to it. • Deep Learning models typically rely on large amounts of annotated data to be trained which is especially challenging in the context of drug discovery. How can we improve model’s generalization and accuracy given a limited amount of annotated data? Motivation
is an unsupervised learning technique. • It does not require annotations (labels). • It is aimed at leveraging large amounts of unlabeled data. The goal is to learn useful feature representations that can then be transferred to other downstream tasks.
Dataset Model Supervised downstream task Transfer learning After being trained with self-supervision, the model can then be ﬁne tuned on downstream tasks using transfer learning. Self-supervision Example: Train model to predict logP value.
it is possible to ﬁnd pretext tasks that allow the model to learn useful, general feature representations that can then be used on a variety of different downstream tasks. Once the model is trained on the pretext tasks, it can be ﬁne tuned on downstream tasks of interest. This transfer learning step should be beneﬁcial especially when the downstream tasks only contain a low number of samples.
• Atom: Classiﬁcation problem to predict what type of fragment the atom belongs to (from 2000 possible fragments used in SAScore7 computation). • Fragment: Binary classiﬁcation problem based on breaking molecules into fragments and predict for all pairwise comparisons if the fragments originate from the same molecule or not. 20 Methods
of molecules • Atom: Classiﬁcation problem to predict what type of fragment the atom belongs to (from 2000 possible fragments used in SAScore7 computation). • Fragment: Binary classiﬁcation problem based on breaking molecules into fragments and predict for all pairwise comparisons if the fragments originate from the same molecule or not. • Molecule: Multilabel classiﬁcation problem to predict which fragments can be found in the molecule (from the same 2000 fragments).
ADME-T datasets (from MoleculeNet8) - BACE: Contains 1,522 samples with binary labels on binding results with human Beta-secretase 1 (BACE-1). - BBBP: Contains 2,053 samples with binary labels about permeability of the compounds with the blood-brain barrier. - ClinTox: Contains 1,491 samples with two distinct binary labels associated. The ﬁrst label refers to clinical trial toxicity and the second one to the FDA approval status. - SIDER: Contains 1,427 samples with binary labels for 27 different drug side-effects categories. - ToxCast: Contains 8,615 samples with 617 binary labels based on the results of in vitro toxicology experiments. - Tox21: Contains 8,014 samples with 12 binary labels based on toxicity measurements. 23 Methods
using 10 different train / validation / test splits generated with scaffold splitting. Each model was trained on the training data and a checkpoint was saved after each epoch. The best checkpoint was selected via the validation set and the ROC-AUC was measured on the test set. 24 Methods
graph based predictive models for predicting molecular properties. Unfortunately, for the framework tested, the performance largely depends on the model and dataset. The challenge lies in designing pretext tasks that improve performance consistently across model architectures and downstream tasks.
Shoichet Brian K Irwin John J. Zinc–a free database of commercially available compounds for virtual screening. Journal of chemical information and modeling, 2005. • : Anna Gaulton, Anne Hersey, Michał Nowotka, A. Patrícia Bento, Jon Chambers, David Mendez, Prudence Mutowo, Francis Atkinson, Louisa J. Bellis, Elena Cibrián-Uhalte, Mark Davies, Nathan Dedman, Anneli Karlsson, María Paula Magariños, John P. Overington, George Papadatos, Ines Smit, Andrew R. Leach, The ChEMBL database in 2017, Nucleic Acids Research, Volume 45, Issue D1, January 2017, Pages D945–D954, https://doi.org/10.1093/nar/gkw1074 • : Thomas N. Kipf and Max Welling. Semi-supervised classiﬁcation with graph convolutional networks. CoRR, abs/1609.02907, 2016. URLhttp://arxiv.org/abs/1609.02907. • : Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? CoRR, abs/1810.00826, 2018. URLhttp://arxiv.org/abs/1810.00826. • : Kevin Yang, Kyle Swanson, Wengong Jin, Connor Coley, Philipp Eiden, Hua Gao, Angel Guzman-Perez, Timothy Hopper, Brian Kelley, Miriam Mathea, Andrew Palmer, Volker422Settels, Tommi Jaakkola, Klavs Jensen, and Regina Barzilay. Analyzing learned molecular representations for property prediction. Journal of Chemical Information and Modeling, 59(8):3370–3388, Aug 2019. ISSN 1549-9596. doi: 10.1021/acs.jcim.9b00237. URL https://doi.org/10.1021/acs.jcim.9b00237 • : Ertl, P., Schuffenhauer, A. Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions. J Cheminform 1, 8 (2009). https://doi.org/10.1186/1758-2946-1-8 • : Zhenqin Wu, Bharath Ramsundar, Evan N. Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S. Pappu, Karl Leswing, and Vijay S. Pande. Moleculenet: A benchmark for molecular machine learning. CoRR,abs/ 1703.00564, 2017. • : Chang-Ying Ma, Sheng-Yong Yang, Hui Zhang, Ming-Li Xiang, Qi Huang, Yu-Quan Wei. Prediction models of human plasma protein binding rate and oral bioavailability derived by using GA–CG–SVM method. Journal of Pharmaceutical and Biomedical Analysis, Volume 47, Issues 4–5, 2008, Pages 677-682. • : Carbon-Mangels, M. and Hutter, M.C. (2011), Selecting Relevant Descriptors for Classiﬁcation by Bayesian Estimates: A Comparison with Decision Trees and Support Vector Machines Approaches for Disparate Data Sets. Mol. Inf., 30: 885-895. https://doi.org/10.1002/minf.201100069 • : Shuangquan Wang, Huiyong Sun, Hui Liu, Dan Li, Youyong Li, and Tingjun Hou. Admet evaluation in drug discovery. Predicting herg blockers by combining multiple pharmacophores and machine learning approaches. Molecular Pharmaceutics, 13(8):2855–2866, 2016. PMID:27379394. • : Youjun Xu, Ziwei Dai, Fangjin Chen, Shuaishi Gao, Jianfeng Pei, and Luhua Lai. Deep learning for drug- induced liver injury. Journal of Chemical Information and Modeling, 55(10):2085–2093, 2015. PMID:26437739. • : Mark Wenlock and Nicholas Tomkinson. Experimental in vitro dmpk and physicochemical data on a set of publicly disclosed compounds. DOI: 10.6019/CHEMBL3301361 • : Dominique Douguet. Data sets representative of the structures and experimental properties of fda- approved drugs. ACS Medicinal Chemistry Letters, 9(3):204–209, 2018.