Slide 1

Slide 1 text

SmilesFormer: Language Model for Molecular Design Elix, Inc. Chem-Bio Informatics Society (CBI) Annual Meeting 2022, Tokyo Japan | October 26, 2022 Joshua Owoyemi, Ph.D & Nazim Medzhidov, Ph.D

Slide 2

Slide 2 text

2 Introduction ● Generative models play a major role in discovering and designing new molecules, which is key to innovation in in-silico drugs discovery ● There is a vast amount of molecular data available, therefore generative models should be able to learn concepts of valid and desired molecules using a data-centric approach. Challenges: ● Model that can take advantage of vast available dataset to learn efficient molecular representations ● Need for methods to efficiently transverse possible chemical space in order to generate molecules that satisfy desired multi-objective. ● Demonstration of how proposed method can be utilized in a practical use case.

Slide 3

Slide 3 text

● Language models can fill in an incomplete sentence. ● Recent language models can scale to vast amount of data. 3 Proposed Method: Language model for molecule generation The quick brown fox ___ __ the lazy dog Language Model The quick brown fox walks around the lazy dog The quick brown fox jumps over the lazy dog The quick brown fox bumps into the lazy dog * Common Crawl Dataset contains nearly 1 trillion words[+] [+] T. Brown et al., “Language Models are Few-Shot Learners,” in Advances in Neural Information Processing Systems, 2020, vol. 33, pp. 1877–1901. [Online]. Available: https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf

Slide 4

Slide 4 text

● Language models can fill in an incomplete sentence. ● Language models can scale to vast amount of data. 4 Proposed Method: Language model for molecule generation The quick brown fox ___ __ the lazy dog Language Model The quick brown fox walks around the lazy dog The quick brown fox jumps over the lazy dog The quick brown fox bumps into the lazy dog * Common Crawl Dataset contains nearly 1 trillion words[+] [+] T. Brown et al., “Language Models are Few-Shot Learners,” in Advances in Neural Information Processing Systems, 2020, vol. 33, pp. 1877–1901. [Online]. Available: https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf Language Model * ZINC Dataset contains over 200 million compounds[^] [^] J. J. Irwin et al., “ZINC20—A Free Ultralarge-Scale Chemical Database for Ligand Discovery,” Journal of Chemical Information and Modeling, vol. 60, no. 12, pp. 6065–6073, 2020, doi: 10.1021/acs.jcim.0c00675.

Slide 5

Slide 5 text

● We formulate the problem as a language modeling problem, where the model is able to generate a full sequence (sentence) from incomplete ones, or some prior knowledge. ● Employ a state-of-the-art language model to encode a molecular latent space by generating valid molecule (sequences) from fragments and fragment compositions. 5 Proposed Method: SmilesFormer Model E D * Reconstruction Inputs Transformer[1] + VAE[2] Query Sequence Backprop

Slide 6

Slide 6 text

● Molecules are broken down into fragments using simple strategies ● Model inputs can be (i) full sequences, (ii) fragments, (iii) fragment compositions, and (iii) dummy fragments - “*”. ● Fragments can be ‘composed’ using the [COMP] token 6 Proposed Method: Model Inputs RECAP[+] Fragmentation Fragment-on-bonds Full sequence transformation [*]c1ccc(S(N)(=O)=O)cc1[COMP][*]c1cc(C(F)(F)F)nn1[*][COMP][*]c1ccc(C)cc1 Fragment Composition [+] X. Q. Lewell, D. B. Judd, S. P. Watson, and M. M. Hann, “RECAPsRetrosynthetic Combinatorial Analysis Procedure: A Powerful New Technique for Identifying Privileged Molecular Fragments with Useful Applications in Combinatorial Chemistry,” p. 12.

Slide 7

Slide 7 text

7 Proposed Method: Molecule Optimization Encoder Output PLogP > 1.0 QED > 0.8 SA > 8.0 ... Multi-objective History of good latent variables 𝛼 ᐧ PropNN[+] Gradients V t-1 V t Decoder V t Proposed Samples Query Tokens CC1= 0 1 2 3 4 5 6 Optimization Steps V t-1 V t-h Backtrack [+] J. Mueller, D. Gifford, and T. Jaakkola, “Sequence to Better Sequence: Continuous Revision of Combinatorial Structures,” in Proceedings of the 34th International Conference on Machine Learning, Aug. 2017, vol. 70, pp. 2536–2544. [Online]. Available: https://proceedings.mlr.press/v70/mueller17a.html ● Molecules with desired properties can then be generated by exploring the latent space through a gradient-based strategy which guides the latent variables to the chemical space that satisfy desired objectives. Latent space revision

Slide 8

Slide 8 text

8 Experiments • Distribution-directed Benchmarks: • Guacamol, MOSES • Goal-directed Benchmarks (Multi-objective optimization) • JNK3 Inhibition: Inhibition against c-Jun N-terminal Kinases-3 (JNK3), belonging to the mitogen-activated protein kinase family, and are responsive to stress stimuli, such as cytokines, ultraviolet irradiation, heat shock, and osmotic shock. • GSK3β Inhibition: An enzyme associated with an increased susceptibility towards bipolar disorder. • Donepezil Rediscovery: Real use-case demonstration of drug rediscovery pipeline for Donepezil, a known acetylcholinesterase (AChE) inhibitor. • Backbone-constrained molecule optimization: Useful in lead optimization stages • Experiments with model parameters: • Model stochasticity • Fragmentation strategies • Full sequence suggestions • Start tokens constraints • Fragments analysis

Slide 9

Slide 9 text

9 Results: Distribution Benchmarks AAE Graph MCTS Random Sampler SMILES LSTM VAE ORGAN SmilesFormer Validity 0.822 1 1 0.959 0.87 0.379 1 Uniqueness 1 1 0.997 1 0.999 0.841 1 Novelty 0.998 0.994 0 0.912 0.974 0.686 0.9958 KL divergence 0.886 0.522 0.998 0.991 0.982 0.267 0.8722 Frechet ChemNet Distance 0.529 0.015 0.929 0.913 0.863 0 0.1537 Model Valid (↑) Unique @1k (↑) Unique@ 10k (↑) FCD (↓) SNN (↑) Frag (↑) Scaf (↑) IntDiv (↑) IntDiv2 (↑) Filters (↑) Novelty (↑) Test TestSF Test TestSF Test TestSF Test TestSF Train 1 1 1 0.008 0.4755 0.6419 0.5859 1 0.9986 0.9907 0 0.8567 0.8508 1 1 HMM 0.076±0.0322 0.623 ±0.1224 0.5671 ±0.1424 24.4661 ±2.5251 25.4312 ±2.5599 0.3876 ±0.0107 0.3795 ±0.0107 0.5754 ±0.1224 0.5681 ±0.1218 0.2065 ±0.0481 0.049 ±0.018 0.8466 ±0.0403 0.8104 ±0.0507 0.9024 ±0.0489 0.9994±0.001 CharRNN 0.9748±0.0264 1.0±0.0 0.9994 ±0.0003 0.0732 ±0.0247 0.5204 ±0.0379 0.6015 ±0.0206 0.5649 ±0.0142 0.9998 ±0.0002 0.9983 ±0.0003 0.9242 ±0.0058 0.1101 ±0.0081 0.8562 ±0.0005 0.8503 ±0.0005 0.9943 ±0.0034 0.8419±0.0509 AAE 0.9368±0.0341 1.0±0.0 0.9973 ±0.002 0.5555 ±0.2033 1.0572 ±0.2375 0.6081 ±0.0043 0.5677 ±0.0045 0.991 ±0.0051 0.9905 ±0.0039 0.9022 ±0.0375 0.0789 ±0.009 0.8557 ±0.0031 0.8499 ±0.003 0.996 ±0.0006 0.7931±0.0285 VAE 0.9767±0.0012 1.0±0.0 0.9984 ±0.0005 0.099 ±0.0125 0.567 ±0.0338 0.6257 ±0.0005 0.5783 ±0.0008 0.9994 ±0.0001 0.9984 ±0.0003 0.9386 ±0.0021 0.0588 ±0.0095 0.8558 ±0.0004 0.8498 ±0.0004 0.997 ±0.0002 0.6949±0.0069 JTN-VAE 1.0±0.0 1.0±0.0 0.9996 ±0.0003 0.3954 ±0.0234 0.9382 ±0.0531 0.5477 ±0.0076 0.5194 ±0.007 0.9965 ±0.0003 0.9947 ±0.0002 0.8964 ±0.0039 0.1009 ±0.0105 0.8551 ±0.0034 0.8493 ±0.0035 0.976 ±0.0016 0.9143±0.0058 LatentGAN 0.8966±0.0029 1.0±0.0 0.9968 ±0.0002 0.2968 ±0.0087 0.8281 ±0.0117 0.5371 ±0.0004 0.5132 ±0.0002 0.9986 ±0.0004 0.9972 ±0.0007 0.8867 ±0.0009 0.1072 ±0.0098 0.8565 ±0.0007 0.8505 ±0.0006 0.9735 ±0.0006 0.9498±0.0006 SmilesFormer 1.0±0.0 1.0±0.0 1.0±0.0 15.665 ±0.04 16.467 ±0.001 0.4025 ±0.003 0.3903 ±0.005 0.8373 ±0.2 0.8583 ±0.0002 0.1438 ±0.004 0.06336 ±0.01 0.9144±0.0 0.9020 ±0.0 0.4947 ±0.003 0.99994±0.00001 • Guacamol[+] • MOSES[^] [+] N. Brown, M. Fiscato, M. H. S. Segler, and A. C. Vaucher, “GuacaMol: Benchmarking Models for de Novo Molecular Design,” J. Chem. Inf. Model., vol. 59, no. 3, pp. 1096–1108, Mar. 2019, doi: 10.1021/acs.jcim.8b00839. [^] D. Polykovskiy et al., “Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models,” arXiv:1811.12823 [cs, stat], Oct. 2020, Accessed: Nov. 21, 2021. [Online]. Available: http://arxiv.org/abs/1811.12823

Slide 10

Slide 10 text

10 Results: Goal-directed Benchmarks (Multi-objective optimization) [+] N. Brown, M. Fiscato, M. H. S. Segler, and A. C. Vaucher, “GuacaMol: Benchmarking Models for de Novo Molecular Design,” J. Chem. Inf. Model., vol. 59, no. 3, pp. 1096–1108, Mar. 2019, doi: 10.1021/acs.jcim.8b00839. [^] D. Polykovskiy et al., “Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models,” arXiv:1811.12823 [cs, stat], Oct. 2020, Accessed: Nov. 21, 2021. [Online]. Available: http://arxiv.org/abs/1811.12823 Single Objective Optimization Multi Objective Optimization Oracle Efficiency during molecule optimization Top molecules for QED+SA+JNK3+GSK3β. The scores are SA, QED, JNK3, and GSK3β respectively

Slide 11

Slide 11 text

11 Results: Donepezil Rediscovery Model Pretraining ● Training on large scale dataset ● Objective: Learn Smiles vocabulary of possible chemical space ● Dataset ○ ChEMBL (2.1M mols) ○ Post-donepezil-disco very AChE-active molecules were excluded Model Fine-tuning on selected dataset ● Dataset curation for active molecules on AChE. ● Exclusion of donepezil scaffold and relevant molecules Molecule generation and optimization ● Generate molecules and optimize with multi-property objectives (1) SA < 1, Target = 0; (2) QED > 0, Target = 1; (3) PLogP > 0, Target = 40; (4) Molecular Weight 250 < > 750; (5) Number of Hydrogen Bonds 5 < > 10; (6) Topological Polar Surface Area (TPSA) 75 < > 150; (7) Number of Rotatable Bonds 5 < > 10; (8) Similarity to tacrine Ci < 1, Target = 0; (9) Similarity to physostigmine < 1, Target = 0 (10) Similarity to rivastigmine Ci < 1, Target = 0; and (11) AChE inhibitor activity pIC50 (negative log IC50), determined by a trained predictive model on AChE Inhibitor active compounds Ci > 1, Target = 1 Molecules post-processing and analysis ● Novelty ranking ● Activity ranking ● Scaffold analysis Stage 1 Stage 2

Slide 12

Slide 12 text

12 Results: Donepezil Rediscovery Top 10 scaffolds from the generated molecules. The numbers signify how many times the scaffold appears out of 10 experiment runs. Donepezil Donepezil Scaffold Identified Molecule (Compound 14 from the original Donepezil paper[+]) [+] H. Sugimoto, H. Ogura, Y. Arai, Y. Iimura, and Y. Yamanishi, “Research and Development of Donepezil Hydrochloride, a New Type of Acetylcholinesterase Inhibitor,” Japanese Journal of Pharmacology, vol. 89, no. 1, pp. 7–20, 2002, doi: 10.1254/jjp.89.7.

Slide 13

Slide 13 text

13 Results: Backbone-constrained molecule generation Our model is able to learn the generation of molecules with constrained backbones. The molecule in the green box is the input backbone. The generated SMILES string and Tanimoto similarity score to the input backbone is shown

Slide 14

Slide 14 text

14 Results: Backbone-constrained molecule generation Our model is able to learn the generation of molecules with constrained backbones. The molecule in the green box is the input backbone. The generated SMILES string and Tanimoto similarity score to the input backbone is shown

Slide 15

Slide 15 text

15 Results: Experiments with model parameters . Molecule generation without fragmentation Molecule generation with fragmentation

Slide 16

Slide 16 text

16 Conclusions ● We proposed and demonstrated a language model that learns SMILES vocabulary in order to generate novel compounds ● The trained model can be used in molecular optimization task using customizable parameters ● We demonstrated the application of our model in various de novo molecular design tasks include a real use case of donepezil rediscovery Future Directions: ● Property-aware molecule generation ● Matching molecular pairs generation

Slide 17

Slide 17 text

www.elix-inc.com

Slide 18

Slide 18 text

18 Appendix

Slide 19

Slide 19 text

19 Input Tokenization • Smiles strings are tokenized based on a predefined dictionary.[+] • A dictionary of 591 tokens is used. • Tokens are characters or group of characters that form a single unit. Token Dict ID C 16 [C@H] 33 [N+] 41 [SiH2] 127 [PAD] 0 special, for padding [COMP] 6 special, for composition [BOC] 9 special, for Beginning of Compound [EOC] 10 special, for End of Compound [UNK] 11 special, for unknown tokens [+] P. Schwaller et al., “Molecular Transformer: A Model for Uncertainty-Calibrated Chemical Reaction Prediction,” ACS Cent. Sci., vol. 5, no. 9, pp. 1572–1583, Sep. 2019, doi: 10.1021/acscentsci.9b00576.

Slide 20

Slide 20 text

2 Model Training and Evaluation Details Model parameters 12 Layers 12 heads latent space embedding dimension: 288, 0.15 dropout Training: Dataset constraints: ● Max molweight 900, ● Max token length 100 Datasets collection Name Size Description Small Datasets 872,557 Small datasets from predictive model benchmarks ChEMBL 3.0 3,429,071 Elix Small + Guacamol (train) + Molsets (train)

Slide 21

Slide 21 text

21 Results: Experiments with model parameters .