Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Improving Molecular Property Prediction using Self-supervised Learning, Elix, CBI 2021

Elix
October 27, 2021

Improving Molecular Property Prediction using Self-supervised Learning, Elix, CBI 2021

Elix

October 27, 2021
Tweet

More Decks by Elix

Other Decks in Technology

Transcript

  1. Improving Molecular Property Prediction
    using Self-supervised Learning
    27th October 2021
    Laurent Dillard

    View Slide

  2. 2
    Table of Contents
    • Motivation
    • Self-supervised learning
    • Methods
    • Results
    • Conclusion
    • Appendix
    • References

    View Slide

  3. 3
    • Data scarcity is a common bane of deep learning applications and drug discovery is no exception to it.
    Motivation

    View Slide

  4. 4
    • Data scarcity is a common bane of deep learning applications and drug discovery is no exception to it.
    • Deep Learning models typically rely on large amounts of annotated data to be trained which is especially
    challenging in the context of drug discovery.
    Motivation

    View Slide

  5. 5
    • Data scarcity is a common bane of deep learning applications and drug discovery is no exception to it.
    • Deep Learning models typically rely on large amounts of annotated data to be trained which is especially
    challenging in the context of drug discovery.
    How can we improve model’s generalization and accuracy given a limited amount of annotated data?
    Motivation

    View Slide

  6. 6
    Annotated data is hard to come by but…
    Motivation

    View Slide

  7. 7
    Annotated data is hard to come by but…
    Large databases1,2,3 of chemical structures are growing at unprecedented rates.
    Motivation

    View Slide

  8. 8
    Annotated data is hard to come by but…
    Large databases1,2,3 of chemical structures are growing at unprecedented rates.
    These databases can be leveraged using unsupervised learning techniques.
    Motivation

    View Slide

  9. 9
    Self-supervised learning
    What is self-supervised learning?
    • Self-supervised learning is an unsupervised learning technique.

    View Slide

  10. 10
    Self-supervised learning
    What is self-supervised learning?
    • Self-supervised learning is an unsupervised learning technique.
    • It does not require annotations (labels).

    View Slide

  11. 11
    Self-supervised learning
    What is self-supervised learning?
    • Self-supervised learning is an unsupervised learning technique.
    • It does not require annotations (labels).
    • It is aimed at leveraging large amounts of unlabeled data.

    View Slide

  12. 12
    Self-supervised learning
    What is self-supervised learning?
    • Self-supervised learning is an unsupervised learning technique.
    • It does not require annotations (labels).
    • It is aimed at leveraging large amounts of unlabeled data.
    The goal is to learn useful feature representations that can then be transferred to other downstream tasks.

    View Slide

  13. 13
    Self-supervised learning
    Unlabeled
    Dataset
    Model

    View Slide

  14. 14
    Self-supervised learning
    Unlabeled
    Dataset
    Model
    Self-supervised pretext task Self-supervision
    Example: Train model to
    predict logP value.

    View Slide

  15. 15
    Self-supervised learning
    Unlabeled
    Dataset
    Model
    Self-supervised pretext task
    Labeled
    Dataset
    Model
    Supervised downstream task
    Transfer learning
    After being trained with
    self-supervision, the
    model can then be fine
    tuned on downstream
    tasks using transfer
    learning.
    Self-supervision
    Example: Train model to
    predict logP value.

    View Slide

  16. 16
    Self-supervised learning
    The assumption of self-supervised learning is that it is possible to find pretext tasks that allow the model to
    learn useful, general feature representations that can then be used on a variety of different downstream
    tasks.
    Once the model is trained on the pretext tasks, it can be fine tuned on downstream tasks of interest. This
    transfer learning step should be beneficial especially when the downstream tasks only contain a low number
    of samples.

    View Slide

  17. 17
    Methods
    Our work focused on Graph Neural Networks (GNNs) architectures, which have proved to be very efficient for
    molecular applications. We considered 3 different types of GNNs: GCN4, GIN5 and DMPNN6.

    View Slide

  18. Self-supervision pretext tasks
    Focus on 3 distinct scales of molecules
    18
    Methods

    View Slide

  19. 19
    Methods
    Self-supervision pretext tasks
    Focus on 3 distinct scales of molecules
    • Atom: Classification problem to predict what type of fragment the atom belongs to (from 2000 possible
    fragments used in SAScore7 computation).

    View Slide

  20. Self-supervision pretext tasks
    Focus on 3 distinct scales of molecules
    • Atom: Classification problem to predict what type of fragment the atom belongs to (from 2000 possible
    fragments used in SAScore7 computation).
    • Fragment: Binary classification problem based on breaking molecules into fragments and predict for all
    pairwise comparisons if the fragments originate from the same molecule or not.
    20
    Methods

    View Slide

  21. 21
    Methods
    Self-supervision pretext tasks
    Focus on 3 distinct scales of molecules
    • Atom: Classification problem to predict what type of fragment the atom belongs to (from 2000 possible
    fragments used in SAScore7 computation).
    • Fragment: Binary classification problem based on breaking molecules into fragments and predict for all
    pairwise comparisons if the fragments originate from the same molecule or not.
    • Molecule: Multilabel classification problem to predict which fragments can be found in the molecule (from
    the same 2000 fragments).

    View Slide

  22. 22
    Methods

    View Slide

  23. Self-supervision dataset: subset of 250,000 molecules from ZINC15
    Downstream datasets: ADME-T datasets (from MoleculeNet8)
    - BACE: Contains 1,522 samples with binary labels on binding results with human Beta-secretase 1 (BACE-1).
    - BBBP: Contains 2,053 samples with binary labels about permeability of the compounds with the blood-brain barrier.
    - ClinTox: Contains 1,491 samples with two distinct binary labels associated. The first label refers to clinical trial toxicity
    and the second one to the FDA approval status.
    - SIDER: Contains 1,427 samples with binary labels for 27 different drug side-effects categories.
    - ToxCast: Contains 8,615 samples with 617 binary labels based on the results of in vitro toxicology experiments.
    - Tox21: Contains 8,014 samples with 12 binary labels based on toxicity measurements.
    23
    Methods

    View Slide

  24. Evaluation procedure
    For each downstream tasks, the models were evaluated using 10 different train / validation / test splits
    generated with scaffold splitting.
    Each model was trained on the training data and a checkpoint was saved after each epoch. The best
    checkpoint was selected via the validation set and the ROC-AUC was measured on the test set.
    24
    Methods

    View Slide

  25. 25
    Results
    Model / Dataset BBBP (2039) SIDER (1427) Clintox (1478) BACE (1513) Tox21 (7831) Toxcast (8575) Average
    GCN (no pretrain) 0.898 ± 0.033 0.605 ± 0.027 0.909 ± 0.038 0.831 ± 0.027 0.769 ± 0.013 0.651 ± 0.014 0.777 ± 0.025
    GCN SSL 0.904 ± 0.035 0.615 ± 0.017 0.927 ± 0.026 0.858 ± 0.023 0.761 ± 0.024 0.653 ± 0.012 0.786 ± 0.023
    GIN (no pretrain) 0.885 ± 0.038 0.602 ± 0.023 0.902 ± 0.045 0.844 ± 0.028 0.773 ± 0.012 0.625 ± 0.013 0.772 ± 0.027
    GIN SSL 0.891 ± 0.032 0.614 ± 0.017 0.904 ± 0.033 0.854 ± 0.025 0.773 ± 0.018 0.630 ± 0.013 0.778 ± 0.023
    DMPNN (no pretrain) 0.894 ± 0.038 0.620 ± 0.021 0.908 ± 0.029 0.800 ± 0.034 0.762 ± 0.018 0.637 ± 0.012 0.770 ± 0.025
    DMPNN SSL 0.898 ± 0.034 0.605 ± 0.025 0.910 ± 0.040 0.855 ± 0.027 0.757 ± 0.020 0.618 ± 0.013 0.774 ± 0.026

    View Slide

  26. 26
    Results
    Ablation study
    Model / Dataset BBBP (2039) SIDER (1427) Clintox (1478) BACE (1513) Tox21 (7831) Toxcast (8575) Average
    GCN (no pretrain) 0.898 ± 0.033 0.605 ± 0.027 0.909 ± 0.038 0.831 ± 0.027 0.769 ± 0.013 0.651 ± 0.014 0.777 ± 0.025
    GCN SSL Atom only 0.905 ± 0.033 0.614 ± 0.024 0.894 ± 0.042 0.848 ± 0.019 0.766 ± 0.016 0.651 ± 0.008 0.780 ± 0.024
    GCN SSL Molecule only 0.898 ± 0.037 0.610 ± 0.024 0.905 ± 0.041 0.861 ± 0.028 0.752 ± 0.023 0.635 ± 0.008 0.777 ± 0.027
    GCN SSL Fragment only 0.908 ± 0.039 0.611 ± 0.028 0.900 ± 0.040 0.837 ± 0.028 0.769 ± 0.014 0.651 ± 0.010 0.779 ± 0.027
    GCN SSL Atom + Fragment 0.904 ± 0.037 0.621 ± 0.022 0.899 ± 0.040 0.850 ± 0.024 0.769 ± 0.20 0.652 ± 0.010 0.783 ± 0.025
    GCN SSL 0.904 ± 0.035 0.615 ± 0.017 0.927 ± 0.026 0.858 ± 0.023 0.761 ± 0.024 0.653 ± 0.012 0.786 ± 0.023

    View Slide

  27. 27
    Conclusion
    Self-supervised learning can effectively improve the performance of graph based predictive models for
    predicting molecular properties.
    Unfortunately, for the framework tested, the performance largely depends on the model and dataset.
    The challenge lies in designing pretext tasks that improve performance consistently across model
    architectures and downstream tasks.

    View Slide

  28. 28
    References
    • [1]: PubChem. PubChem. URL https://pubchem.ncbi.nlm.nih.gov/
    • [2]: Shoichet Brian K Irwin John J. Zinc–a free database of commercially available compounds for virtual screening. Journal of chemical information and modeling, 2005.
    • [3]: Anna Gaulton, Anne Hersey, Michał Nowotka, A. Patrícia Bento, Jon Chambers, David Mendez, Prudence Mutowo, Francis Atkinson, Louisa J. Bellis, Elena Cibrián-Uhalte, Mark Davies, Nathan Dedman, Anneli Karlsson, María Paula Magariños,
    John P. Overington, George Papadatos, Ines Smit, Andrew R. Leach, The ChEMBL database in 2017, Nucleic Acids Research, Volume 45, Issue D1, January 2017, Pages D945–D954, https://doi.org/10.1093/nar/gkw1074
    • [4]: Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. CoRR, abs/1609.02907, 2016. URLhttp://arxiv.org/abs/1609.02907.
    • [5]: Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? CoRR, abs/1810.00826, 2018. URLhttp://arxiv.org/abs/1810.00826.
    • [6]: Kevin Yang, Kyle Swanson, Wengong Jin, Connor Coley, Philipp Eiden, Hua Gao, Angel Guzman-Perez, Timothy Hopper, Brian Kelley, Miriam Mathea, Andrew Palmer, Volker422Settels, Tommi Jaakkola, Klavs Jensen, and Regina Barzilay.
    Analyzing learned molecular representations for property prediction. Journal of Chemical Information and Modeling, 59(8):3370–3388, Aug 2019. ISSN 1549-9596. doi: 10.1021/acs.jcim.9b00237. URL
    https://doi.org/10.1021/acs.jcim.9b00237
    • [7]: Ertl, P., Schuffenhauer, A. Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions. J Cheminform 1, 8 (2009). https://doi.org/10.1186/1758-2946-1-8
    • [8]: Zhenqin Wu, Bharath Ramsundar, Evan N. Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S. Pappu, Karl Leswing, and Vijay S. Pande. Moleculenet: A benchmark for molecular machine learning. CoRR,abs/ 1703.00564, 2017.
    • [9]: Chang-Ying Ma, Sheng-Yong Yang, Hui Zhang, Ming-Li Xiang, Qi Huang, Yu-Quan Wei. Prediction models of human plasma protein binding rate and oral bioavailability derived by using GA–CG–SVM method. Journal of Pharmaceutical and
    Biomedical Analysis, Volume 47, Issues 4–5, 2008, Pages 677-682.
    • [10]: Carbon-Mangels, M. and Hutter, M.C. (2011), Selecting Relevant Descriptors for Classification by Bayesian Estimates: A Comparison with Decision Trees and Support Vector Machines Approaches for Disparate Data Sets. Mol. Inf., 30: 885-895.
    https://doi.org/10.1002/minf.201100069
    • [11]: Shuangquan Wang, Huiyong Sun, Hui Liu, Dan Li, Youyong Li, and Tingjun Hou. Admet evaluation in drug discovery. Predicting herg blockers by combining multiple pharmacophores and machine learning approaches. Molecular Pharmaceutics,
    13(8):2855–2866, 2016. PMID:27379394.
    • [12]: Youjun Xu, Ziwei Dai, Fangjin Chen, Shuaishi Gao, Jianfeng Pei, and Luhua Lai. Deep learning for drug- induced liver injury. Journal of Chemical Information and Modeling, 55(10):2085–2093, 2015. PMID:26437739.
    • [13]: Mark Wenlock and Nicholas Tomkinson. Experimental in vitro dmpk and physicochemical data on a set of publicly disclosed compounds. DOI: 10.6019/CHEMBL3301361
    • [14]: Dominique Douguet. Data sets representative of the structures and experimental properties of fda- approved drugs. ACS Medicinal Chemistry Letters, 9(3):204–209, 2018.

    View Slide

  29. 29
    Results
    Additional experiments
    Changing pretraining set size from 250k to 2M or 10M did not yield significant change in performance. The
    performance on the pretext tasks saturates in all cases. Improving the difficulty of the pretext task could
    change that.

    View Slide

  30. 30
    Appendix
    Model / Dataset Bioavailability
    MA (640)
    ROC-AUC
    CYP 2C9 Substrate
    (669)
    ROC-AUC
    CYP 2D6 Substrate
    (667)
    ROC-AUC
    hERG Blockers (665)
    ROC-AUC
    DILI (475)
    ROC-AUC
    Clearance Microsome
    AZ (1102)
    R2
    PPBR EDrug3d (828)
    R2
    GCN (no pretrain) 0.698 0.618 0.649 0.870 0.856 0.043 0.345
    GCN SSL 0.675 0.606 0.593 0.878 0.861 0.112 0.381
    Additional experiments on other ADME-T datasets
    Bioavailability MA9 is a dataset of 640 samples measuring absorption properties.
    CYP 2C9 and CYP 2D610 datasets contain respectively 669 and 667 samples measuring metabolic properties.
    hERG Blockers11 and DILI12 contain 665 and 475 samples respectively measuring toxicity properties.
    Clearance Microsome AZ13 contains 1102 samples measuring excretion properties.
    PPBR EDrug3d14 contains 828 samples and measures distribution properties.

    View Slide