Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Improving Molecular Property Prediction using Self-supervised Learning, Elix, CBI 2021

Elix
October 27, 2021

Improving Molecular Property Prediction using Self-supervised Learning, Elix, CBI 2021

Elix

October 27, 2021
Tweet

More Decks by Elix

Other Decks in Technology

Transcript

  1. 2 Table of Contents • Motivation • Self-supervised learning •

    Methods • Results • Conclusion • Appendix • References
  2. 3 • Data scarcity is a common bane of deep

    learning applications and drug discovery is no exception to it. Motivation
  3. 4 • Data scarcity is a common bane of deep

    learning applications and drug discovery is no exception to it. • Deep Learning models typically rely on large amounts of annotated data to be trained which is especially challenging in the context of drug discovery. Motivation
  4. 5 • Data scarcity is a common bane of deep

    learning applications and drug discovery is no exception to it. • Deep Learning models typically rely on large amounts of annotated data to be trained which is especially challenging in the context of drug discovery. How can we improve model’s generalization and accuracy given a limited amount of annotated data? Motivation
  5. 7 Annotated data is hard to come by but… Large

    databases1,2,3 of chemical structures are growing at unprecedented rates. Motivation
  6. 8 Annotated data is hard to come by but… Large

    databases1,2,3 of chemical structures are growing at unprecedented rates. These databases can be leveraged using unsupervised learning techniques. Motivation
  7. 10 Self-supervised learning What is self-supervised learning? • Self-supervised learning

    is an unsupervised learning technique. • It does not require annotations (labels).
  8. 11 Self-supervised learning What is self-supervised learning? • Self-supervised learning

    is an unsupervised learning technique. • It does not require annotations (labels). • It is aimed at leveraging large amounts of unlabeled data.
  9. 12 Self-supervised learning What is self-supervised learning? • Self-supervised learning

    is an unsupervised learning technique. • It does not require annotations (labels). • It is aimed at leveraging large amounts of unlabeled data. The goal is to learn useful feature representations that can then be transferred to other downstream tasks.
  10. 15 Self-supervised learning Unlabeled Dataset Model Self-supervised pretext task Labeled

    Dataset Model Supervised downstream task Transfer learning After being trained with self-supervision, the model can then be fine tuned on downstream tasks using transfer learning. Self-supervision Example: Train model to predict logP value.
  11. 16 Self-supervised learning The assumption of self-supervised learning is that

    it is possible to find pretext tasks that allow the model to learn useful, general feature representations that can then be used on a variety of different downstream tasks. Once the model is trained on the pretext tasks, it can be fine tuned on downstream tasks of interest. This transfer learning step should be beneficial especially when the downstream tasks only contain a low number of samples.
  12. 17 Methods Our work focused on Graph Neural Networks (GNNs)

    architectures, which have proved to be very efficient for molecular applications. We considered 3 different types of GNNs: GCN4, GIN5 and DMPNN6.
  13. 19 Methods Self-supervision pretext tasks Focus on 3 distinct scales

    of molecules • Atom: Classification problem to predict what type of fragment the atom belongs to (from 2000 possible fragments used in SAScore7 computation).
  14. Self-supervision pretext tasks Focus on 3 distinct scales of molecules

    • Atom: Classification problem to predict what type of fragment the atom belongs to (from 2000 possible fragments used in SAScore7 computation). • Fragment: Binary classification problem based on breaking molecules into fragments and predict for all pairwise comparisons if the fragments originate from the same molecule or not. 20 Methods
  15. 21 Methods Self-supervision pretext tasks Focus on 3 distinct scales

    of molecules • Atom: Classification problem to predict what type of fragment the atom belongs to (from 2000 possible fragments used in SAScore7 computation). • Fragment: Binary classification problem based on breaking molecules into fragments and predict for all pairwise comparisons if the fragments originate from the same molecule or not. • Molecule: Multilabel classification problem to predict which fragments can be found in the molecule (from the same 2000 fragments).
  16. Self-supervision dataset: subset of 250,000 molecules from ZINC15 Downstream datasets:

    ADME-T datasets (from MoleculeNet8) - BACE: Contains 1,522 samples with binary labels on binding results with human Beta-secretase 1 (BACE-1). - BBBP: Contains 2,053 samples with binary labels about permeability of the compounds with the blood-brain barrier. - ClinTox: Contains 1,491 samples with two distinct binary labels associated. The first label refers to clinical trial toxicity and the second one to the FDA approval status. - SIDER: Contains 1,427 samples with binary labels for 27 different drug side-effects categories. - ToxCast: Contains 8,615 samples with 617 binary labels based on the results of in vitro toxicology experiments. - Tox21: Contains 8,014 samples with 12 binary labels based on toxicity measurements. 23 Methods
  17. Evaluation procedure For each downstream tasks, the models were evaluated

    using 10 different train / validation / test splits generated with scaffold splitting. Each model was trained on the training data and a checkpoint was saved after each epoch. The best checkpoint was selected via the validation set and the ROC-AUC was measured on the test set. 24 Methods
  18. 25 Results Model / Dataset BBBP (2039) SIDER (1427) Clintox

    (1478) BACE (1513) Tox21 (7831) Toxcast (8575) Average GCN (no pretrain) 0.898 ± 0.033 0.605 ± 0.027 0.909 ± 0.038 0.831 ± 0.027 0.769 ± 0.013 0.651 ± 0.014 0.777 ± 0.025 GCN SSL 0.904 ± 0.035 0.615 ± 0.017 0.927 ± 0.026 0.858 ± 0.023 0.761 ± 0.024 0.653 ± 0.012 0.786 ± 0.023 GIN (no pretrain) 0.885 ± 0.038 0.602 ± 0.023 0.902 ± 0.045 0.844 ± 0.028 0.773 ± 0.012 0.625 ± 0.013 0.772 ± 0.027 GIN SSL 0.891 ± 0.032 0.614 ± 0.017 0.904 ± 0.033 0.854 ± 0.025 0.773 ± 0.018 0.630 ± 0.013 0.778 ± 0.023 DMPNN (no pretrain) 0.894 ± 0.038 0.620 ± 0.021 0.908 ± 0.029 0.800 ± 0.034 0.762 ± 0.018 0.637 ± 0.012 0.770 ± 0.025 DMPNN SSL 0.898 ± 0.034 0.605 ± 0.025 0.910 ± 0.040 0.855 ± 0.027 0.757 ± 0.020 0.618 ± 0.013 0.774 ± 0.026
  19. 26 Results Ablation study Model / Dataset BBBP (2039) SIDER

    (1427) Clintox (1478) BACE (1513) Tox21 (7831) Toxcast (8575) Average GCN (no pretrain) 0.898 ± 0.033 0.605 ± 0.027 0.909 ± 0.038 0.831 ± 0.027 0.769 ± 0.013 0.651 ± 0.014 0.777 ± 0.025 GCN SSL Atom only 0.905 ± 0.033 0.614 ± 0.024 0.894 ± 0.042 0.848 ± 0.019 0.766 ± 0.016 0.651 ± 0.008 0.780 ± 0.024 GCN SSL Molecule only 0.898 ± 0.037 0.610 ± 0.024 0.905 ± 0.041 0.861 ± 0.028 0.752 ± 0.023 0.635 ± 0.008 0.777 ± 0.027 GCN SSL Fragment only 0.908 ± 0.039 0.611 ± 0.028 0.900 ± 0.040 0.837 ± 0.028 0.769 ± 0.014 0.651 ± 0.010 0.779 ± 0.027 GCN SSL Atom + Fragment 0.904 ± 0.037 0.621 ± 0.022 0.899 ± 0.040 0.850 ± 0.024 0.769 ± 0.20 0.652 ± 0.010 0.783 ± 0.025 GCN SSL 0.904 ± 0.035 0.615 ± 0.017 0.927 ± 0.026 0.858 ± 0.023 0.761 ± 0.024 0.653 ± 0.012 0.786 ± 0.023
  20. 27 Conclusion Self-supervised learning can effectively improve the performance of

    graph based predictive models for predicting molecular properties. Unfortunately, for the framework tested, the performance largely depends on the model and dataset. The challenge lies in designing pretext tasks that improve performance consistently across model architectures and downstream tasks.
  21. 28 References • [1]: PubChem. PubChem. URL https://pubchem.ncbi.nlm.nih.gov/ • [2]:

    Shoichet Brian K Irwin John J. Zinc–a free database of commercially available compounds for virtual screening. Journal of chemical information and modeling, 2005. • [3]: Anna Gaulton, Anne Hersey, Michał Nowotka, A. Patrícia Bento, Jon Chambers, David Mendez, Prudence Mutowo, Francis Atkinson, Louisa J. Bellis, Elena Cibrián-Uhalte, Mark Davies, Nathan Dedman, Anneli Karlsson, María Paula Magariños, John P. Overington, George Papadatos, Ines Smit, Andrew R. Leach, The ChEMBL database in 2017, Nucleic Acids Research, Volume 45, Issue D1, January 2017, Pages D945–D954, https://doi.org/10.1093/nar/gkw1074 • [4]: Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. CoRR, abs/1609.02907, 2016. URLhttp://arxiv.org/abs/1609.02907. • [5]: Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? CoRR, abs/1810.00826, 2018. URLhttp://arxiv.org/abs/1810.00826. • [6]: Kevin Yang, Kyle Swanson, Wengong Jin, Connor Coley, Philipp Eiden, Hua Gao, Angel Guzman-Perez, Timothy Hopper, Brian Kelley, Miriam Mathea, Andrew Palmer, Volker422Settels, Tommi Jaakkola, Klavs Jensen, and Regina Barzilay. Analyzing learned molecular representations for property prediction. Journal of Chemical Information and Modeling, 59(8):3370–3388, Aug 2019. ISSN 1549-9596. doi: 10.1021/acs.jcim.9b00237. URL https://doi.org/10.1021/acs.jcim.9b00237 • [7]: Ertl, P., Schuffenhauer, A. Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions. J Cheminform 1, 8 (2009). https://doi.org/10.1186/1758-2946-1-8 • [8]: Zhenqin Wu, Bharath Ramsundar, Evan N. Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S. Pappu, Karl Leswing, and Vijay S. Pande. Moleculenet: A benchmark for molecular machine learning. CoRR,abs/ 1703.00564, 2017. • [9]: Chang-Ying Ma, Sheng-Yong Yang, Hui Zhang, Ming-Li Xiang, Qi Huang, Yu-Quan Wei. Prediction models of human plasma protein binding rate and oral bioavailability derived by using GA–CG–SVM method. Journal of Pharmaceutical and Biomedical Analysis, Volume 47, Issues 4–5, 2008, Pages 677-682. • [10]: Carbon-Mangels, M. and Hutter, M.C. (2011), Selecting Relevant Descriptors for Classification by Bayesian Estimates: A Comparison with Decision Trees and Support Vector Machines Approaches for Disparate Data Sets. Mol. Inf., 30: 885-895. https://doi.org/10.1002/minf.201100069 • [11]: Shuangquan Wang, Huiyong Sun, Hui Liu, Dan Li, Youyong Li, and Tingjun Hou. Admet evaluation in drug discovery. Predicting herg blockers by combining multiple pharmacophores and machine learning approaches. Molecular Pharmaceutics, 13(8):2855–2866, 2016. PMID:27379394. • [12]: Youjun Xu, Ziwei Dai, Fangjin Chen, Shuaishi Gao, Jianfeng Pei, and Luhua Lai. Deep learning for drug- induced liver injury. Journal of Chemical Information and Modeling, 55(10):2085–2093, 2015. PMID:26437739. • [13]: Mark Wenlock and Nicholas Tomkinson. Experimental in vitro dmpk and physicochemical data on a set of publicly disclosed compounds. DOI: 10.6019/CHEMBL3301361 • [14]: Dominique Douguet. Data sets representative of the structures and experimental properties of fda- approved drugs. ACS Medicinal Chemistry Letters, 9(3):204–209, 2018.
  22. 29 Results Additional experiments Changing pretraining set size from 250k

    to 2M or 10M did not yield significant change in performance. The performance on the pretext tasks saturates in all cases. Improving the difficulty of the pretext task could change that.
  23. 30 Appendix Model / Dataset Bioavailability MA (640) ROC-AUC CYP

    2C9 Substrate (669) ROC-AUC CYP 2D6 Substrate (667) ROC-AUC hERG Blockers (665) ROC-AUC DILI (475) ROC-AUC Clearance Microsome AZ (1102) R2 PPBR EDrug3d (828) R2 GCN (no pretrain) 0.698 0.618 0.649 0.870 0.856 0.043 0.345 GCN SSL 0.675 0.606 0.593 0.878 0.861 0.112 0.381 Additional experiments on other ADME-T datasets Bioavailability MA9 is a dataset of 640 samples measuring absorption properties. CYP 2C9 and CYP 2D610 datasets contain respectively 669 and 667 samples measuring metabolic properties. hERG Blockers11 and DILI12 contain 665 and 475 samples respectively measuring toxicity properties. Clearance Microsome AZ13 contains 1102 samples measuring excretion properties. PPBR EDrug3d14 contains 828 samples and measures distribution properties.