Improving Molecular Property Prediction using Self-supervised Learning, Elix, CBI 2021

Improving Molecular Property Prediction using Self-supervised Learning 27th October 2021
Laurent Dillard

2 Table of Contents • Motivation • Self-supervised learning •
Methods • Results • Conclusion • Appendix • References

3 • Data scarcity is a common bane of deep
learning applications and drug discovery is no exception to it. Motivation

learning applications and drug discovery is no exception to it. • Deep Learning models typically rely on large amounts of annotated data to be trained which is especially challenging in the context of drug discovery. Motivation

learning applications and drug discovery is no exception to it. • Deep Learning models typically rely on large amounts of annotated data to be trained which is especially challenging in the context of drug discovery. How can we improve model’s generalization and accuracy given a limited amount of annotated data? Motivation

6 Annotated data is hard to come by but… Motivation

7 Annotated data is hard to come by but… Large
databases1,2,3 of chemical structures are growing at unprecedented rates. Motivation

8 Annotated data is hard to come by but… Large
databases1,2,3 of chemical structures are growing at unprecedented rates. These databases can be leveraged using unsupervised learning techniques. Motivation

9 Self-supervised learning What is self-supervised learning? • Self-supervised learning
is an unsupervised learning technique.

is an unsupervised learning technique. • It does not require annotations (labels).

is an unsupervised learning technique. • It does not require annotations (labels). • It is aimed at leveraging large amounts of unlabeled data.

is an unsupervised learning technique. • It does not require annotations (labels). • It is aimed at leveraging large amounts of unlabeled data. The goal is to learn useful feature representations that can then be transferred to other downstream tasks.

13 Self-supervised learning Unlabeled Dataset Model

14 Self-supervised learning Unlabeled Dataset Model Self-supervised pretext task Self-supervision
Example: Train model to predict logP value.

15 Self-supervised learning Unlabeled Dataset Model Self-supervised pretext task Labeled
Dataset Model Supervised downstream task Transfer learning After being trained with self-supervision, the model can then be ﬁne tuned on downstream tasks using transfer learning. Self-supervision Example: Train model to predict logP value.

16 Self-supervised learning The assumption of self-supervised learning is that
it is possible to find pretext tasks that allow the model to learn useful, general feature representations that can then be used on a variety of different downstream tasks. Once the model is trained on the pretext tasks, it can be fine tuned on downstream tasks of interest. This transfer learning step should be beneficial especially when the downstream tasks only contain a low number of samples.

17 Methods Our work focused on Graph Neural Networks (GNNs)
architectures, which have proved to be very eﬃcient for molecular applications. We considered 3 different types of GNNs: GCN4, GIN5 and DMPNN6.

Self-supervision pretext tasks Focus on 3 distinct scales of molecules
18 Methods

19 Methods Self-supervision pretext tasks Focus on 3 distinct scales
of molecules • Atom: Classiﬁcation problem to predict what type of fragment the atom belongs to (from 2000 possible fragments used in SAScore7 computation).

Self-supervision pretext tasks Focus on 3 distinct scales of molecules
• Atom: Classiﬁcation problem to predict what type of fragment the atom belongs to (from 2000 possible fragments used in SAScore7 computation). • Fragment: Binary classiﬁcation problem based on breaking molecules into fragments and predict for all pairwise comparisons if the fragments originate from the same molecule or not. 20 Methods

21 Methods Self-supervision pretext tasks Focus on 3 distinct scales
of molecules • Atom: Classification problem to predict what type of fragment the atom belongs to (from 2000 possible fragments used in SAScore7 computation). • Fragment: Binary classification problem based on breaking molecules into fragments and predict for all pairwise comparisons if the fragments originate from the same molecule or not. • Molecule: Multilabel classification problem to predict which fragments can be found in the molecule (from the same 2000 fragments).

22 Methods

Self-supervision dataset: subset of 250,000 molecules from ZINC15 Downstream datasets:
ADME-T datasets (from MoleculeNet8) - BACE: Contains 1,522 samples with binary labels on binding results with human Beta-secretase 1 (BACE-1). - BBBP: Contains 2,053 samples with binary labels about permeability of the compounds with the blood-brain barrier. - ClinTox: Contains 1,491 samples with two distinct binary labels associated. The ﬁrst label refers to clinical trial toxicity and the second one to the FDA approval status. - SIDER: Contains 1,427 samples with binary labels for 27 different drug side-effects categories. - ToxCast: Contains 8,615 samples with 617 binary labels based on the results of in vitro toxicology experiments. - Tox21: Contains 8,014 samples with 12 binary labels based on toxicity measurements. 23 Methods

Evaluation procedure For each downstream tasks, the models were evaluated
using 10 different train / validation / test splits generated with scaffold splitting. Each model was trained on the training data and a checkpoint was saved after each epoch. The best checkpoint was selected via the validation set and the ROC-AUC was measured on the test set. 24 Methods

25 Results Model / Dataset BBBP (2039) SIDER (1427) Clintox
(1478) BACE (1513) Tox21 (7831) Toxcast (8575) Average GCN (no pretrain) 0.898 ± 0.033 0.605 ± 0.027 0.909 ± 0.038 0.831 ± 0.027 0.769 ± 0.013 0.651 ± 0.014 0.777 ± 0.025 GCN SSL 0.904 ± 0.035 0.615 ± 0.017 0.927 ± 0.026 0.858 ± 0.023 0.761 ± 0.024 0.653 ± 0.012 0.786 ± 0.023 GIN (no pretrain) 0.885 ± 0.038 0.602 ± 0.023 0.902 ± 0.045 0.844 ± 0.028 0.773 ± 0.012 0.625 ± 0.013 0.772 ± 0.027 GIN SSL 0.891 ± 0.032 0.614 ± 0.017 0.904 ± 0.033 0.854 ± 0.025 0.773 ± 0.018 0.630 ± 0.013 0.778 ± 0.023 DMPNN (no pretrain) 0.894 ± 0.038 0.620 ± 0.021 0.908 ± 0.029 0.800 ± 0.034 0.762 ± 0.018 0.637 ± 0.012 0.770 ± 0.025 DMPNN SSL 0.898 ± 0.034 0.605 ± 0.025 0.910 ± 0.040 0.855 ± 0.027 0.757 ± 0.020 0.618 ± 0.013 0.774 ± 0.026

26 Results Ablation study Model / Dataset BBBP (2039) SIDER
(1427) Clintox (1478) BACE (1513) Tox21 (7831) Toxcast (8575) Average GCN (no pretrain) 0.898 ± 0.033 0.605 ± 0.027 0.909 ± 0.038 0.831 ± 0.027 0.769 ± 0.013 0.651 ± 0.014 0.777 ± 0.025 GCN SSL Atom only 0.905 ± 0.033 0.614 ± 0.024 0.894 ± 0.042 0.848 ± 0.019 0.766 ± 0.016 0.651 ± 0.008 0.780 ± 0.024 GCN SSL Molecule only 0.898 ± 0.037 0.610 ± 0.024 0.905 ± 0.041 0.861 ± 0.028 0.752 ± 0.023 0.635 ± 0.008 0.777 ± 0.027 GCN SSL Fragment only 0.908 ± 0.039 0.611 ± 0.028 0.900 ± 0.040 0.837 ± 0.028 0.769 ± 0.014 0.651 ± 0.010 0.779 ± 0.027 GCN SSL Atom + Fragment 0.904 ± 0.037 0.621 ± 0.022 0.899 ± 0.040 0.850 ± 0.024 0.769 ± 0.20 0.652 ± 0.010 0.783 ± 0.025 GCN SSL 0.904 ± 0.035 0.615 ± 0.017 0.927 ± 0.026 0.858 ± 0.023 0.761 ± 0.024 0.653 ± 0.012 0.786 ± 0.023

27 Conclusion Self-supervised learning can effectively improve the performance of
graph based predictive models for predicting molecular properties. Unfortunately, for the framework tested, the performance largely depends on the model and dataset. The challenge lies in designing pretext tasks that improve performance consistently across model architectures and downstream tasks.

28 References • [1]: PubChem. PubChem. URL https://pubchem.ncbi.nlm.nih.gov/ • [2]:
Shoichet Brian K Irwin John J. Zinc–a free database of commercially available compounds for virtual screening. Journal of chemical information and modeling, 2005. • [3]: Anna Gaulton, Anne Hersey, Michał Nowotka, A. Patrícia Bento, Jon Chambers, David Mendez, Prudence Mutowo, Francis Atkinson, Louisa J. Bellis, Elena Cibrián-Uhalte, Mark Davies, Nathan Dedman, Anneli Karlsson, María Paula Magariños, John P. Overington, George Papadatos, Ines Smit, Andrew R. Leach, The ChEMBL database in 2017, Nucleic Acids Research, Volume 45, Issue D1, January 2017, Pages D945–D954, https://doi.org/10.1093/nar/gkw1074 • [4]: Thomas N. Kipf and Max Welling. Semi-supervised classiﬁcation with graph convolutional networks. CoRR, abs/1609.02907, 2016. URLhttp://arxiv.org/abs/1609.02907. • [5]: Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? CoRR, abs/1810.00826, 2018. URLhttp://arxiv.org/abs/1810.00826. • [6]: Kevin Yang, Kyle Swanson, Wengong Jin, Connor Coley, Philipp Eiden, Hua Gao, Angel Guzman-Perez, Timothy Hopper, Brian Kelley, Miriam Mathea, Andrew Palmer, Volker422Settels, Tommi Jaakkola, Klavs Jensen, and Regina Barzilay. Analyzing learned molecular representations for property prediction. Journal of Chemical Information and Modeling, 59(8):3370–3388, Aug 2019. ISSN 1549-9596. doi: 10.1021/acs.jcim.9b00237. URL https://doi.org/10.1021/acs.jcim.9b00237 • [7]: Ertl, P., Schuffenhauer, A. Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions. J Cheminform 1, 8 (2009). https://doi.org/10.1186/1758-2946-1-8 • [8]: Zhenqin Wu, Bharath Ramsundar, Evan N. Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S. Pappu, Karl Leswing, and Vijay S. Pande. Moleculenet: A benchmark for molecular machine learning. CoRR,abs/ 1703.00564, 2017. • [9]: Chang-Ying Ma, Sheng-Yong Yang, Hui Zhang, Ming-Li Xiang, Qi Huang, Yu-Quan Wei. Prediction models of human plasma protein binding rate and oral bioavailability derived by using GA–CG–SVM method. Journal of Pharmaceutical and Biomedical Analysis, Volume 47, Issues 4–5, 2008, Pages 677-682. • [10]: Carbon-Mangels, M. and Hutter, M.C. (2011), Selecting Relevant Descriptors for Classiﬁcation by Bayesian Estimates: A Comparison with Decision Trees and Support Vector Machines Approaches for Disparate Data Sets. Mol. Inf., 30: 885-895. https://doi.org/10.1002/minf.201100069 • [11]: Shuangquan Wang, Huiyong Sun, Hui Liu, Dan Li, Youyong Li, and Tingjun Hou. Admet evaluation in drug discovery. Predicting herg blockers by combining multiple pharmacophores and machine learning approaches. Molecular Pharmaceutics, 13(8):2855–2866, 2016. PMID:27379394. • [12]: Youjun Xu, Ziwei Dai, Fangjin Chen, Shuaishi Gao, Jianfeng Pei, and Luhua Lai. Deep learning for drug- induced liver injury. Journal of Chemical Information and Modeling, 55(10):2085–2093, 2015. PMID:26437739. • [13]: Mark Wenlock and Nicholas Tomkinson. Experimental in vitro dmpk and physicochemical data on a set of publicly disclosed compounds. DOI: 10.6019/CHEMBL3301361 • [14]: Dominique Douguet. Data sets representative of the structures and experimental properties of fda- approved drugs. ACS Medicinal Chemistry Letters, 9(3):204–209, 2018.

29 Results Additional experiments Changing pretraining set size from 250k
to 2M or 10M did not yield signiﬁcant change in performance. The performance on the pretext tasks saturates in all cases. Improving the diﬃculty of the pretext task could change that.

30 Appendix Model / Dataset Bioavailability MA (640) ROC-AUC CYP
2C9 Substrate (669) ROC-AUC CYP 2D6 Substrate (667) ROC-AUC hERG Blockers (665) ROC-AUC DILI (475) ROC-AUC Clearance Microsome AZ (1102) R2 PPBR EDrug3d (828) R2 GCN (no pretrain) 0.698 0.618 0.649 0.870 0.856 0.043 0.345 GCN SSL 0.675 0.606 0.593 0.878 0.861 0.112 0.381 Additional experiments on other ADME-T datasets Bioavailability MA9 is a dataset of 640 samples measuring absorption properties. CYP 2C9 and CYP 2D610 datasets contain respectively 669 and 667 samples measuring metabolic properties. hERG Blockers11 and DILI12 contain 665 and 475 samples respectively measuring toxicity properties. Clearance Microsome AZ13 contains 1102 samples measuring excretion properties. PPBR EDrug3d14 contains 828 samples and measures distribution properties.

Improving Molecular Property Prediction using S...

Improving Molecular Property Prediction using Self-supervised Learning, Elix, CBI 2021

Elix

More Decks by Elix

Other Decks in Technology

Featured

Transcript

Improving Molecular Property Prediction using Self-supervised Learning 27th October 2021

2 Table of Contents • Motivation • Self-supervised learning •

3 • Data scarcity is a common bane of deep

4 • Data scarcity is a common bane of deep

5 • Data scarcity is a common bane of deep

6 Annotated data is hard to come by but… Motivation

7 Annotated data is hard to come by but… Large

8 Annotated data is hard to come by but… Large

9 Self-supervised learning What is self-supervised learning? • Self-supervised learning

10 Self-supervised learning What is self-supervised learning? • Self-supervised learning

11 Self-supervised learning What is self-supervised learning? • Self-supervised learning

12 Self-supervised learning What is self-supervised learning? • Self-supervised learning

13 Self-supervised learning Unlabeled Dataset Model

14 Self-supervised learning Unlabeled Dataset Model Self-supervised pretext task Self-supervision

15 Self-supervised learning Unlabeled Dataset Model Self-supervised pretext task Labeled

16 Self-supervised learning The assumption of self-supervised learning is that

17 Methods Our work focused on Graph Neural Networks (GNNs)

Self-supervision pretext tasks Focus on 3 distinct scales of molecules

19 Methods Self-supervision pretext tasks Focus on 3 distinct scales

Self-supervision pretext tasks Focus on 3 distinct scales of molecules

21 Methods Self-supervision pretext tasks Focus on 3 distinct scales

22 Methods

Self-supervision dataset: subset of 250,000 molecules from ZINC15 Downstream datasets:

Evaluation procedure For each downstream tasks, the models were evaluated

25 Results Model / Dataset BBBP (2039) SIDER (1427) Clintox

26 Results Ablation study Model / Dataset BBBP (2039) SIDER

27 Conclusion Self-supervised learning can effectively improve the performance of

28 References • [1]: PubChem. PubChem. URL https://pubchem.ncbi.nlm.nih.gov/ • [2]:

29 Results Additional experiments Changing pretraining set size from 250k

30 Appendix Model / Dataset Bioavailability MA (640) ROC-AUC CYP