SmilesFormer: Language Model for Molecular Design, Elix, CBI 2022

SmilesFormer: Language Model for Molecular Design Elix, Inc. Chem-Bio Informatics
Society (CBI) Annual Meeting 2022, Tokyo Japan | October 26, 2022 Joshua Owoyemi, Ph.D & Nazim Medzhidov, Ph.D

2 Introduction • Generative models play a major role in
discovering and designing new molecules, which is key to innovation in in-silico drugs discovery • There is a vast amount of molecular data available, therefore generative models should be able to learn concepts of valid and desired molecules using a data-centric approach. Challenges: • Model that can take advantage of vast available dataset to learn eﬃcient molecular representations • Need for methods to eﬃciently transverse possible chemical space in order to generate molecules that satisfy desired multi-objective. • Demonstration of how proposed method can be utilized in a practical use case.

• Language models can ﬁll in an incomplete sentence. •
Recent language models can scale to vast amount of data. 3 Proposed Method: Language model for molecule generation The quick brown fox ___ __ the lazy dog Language Model The quick brown fox walks around the lazy dog The quick brown fox jumps over the lazy dog The quick brown fox bumps into the lazy dog * Common Crawl Dataset contains nearly 1 trillion words[+] [+] T. Brown et al., “Language Models are Few-Shot Learners,” in Advances in Neural Information Processing Systems, 2020, vol. 33, pp. 1877–1901. [Online]. Available: https://proceedings.neurips.cc/paper/2020/ﬁle/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf

• Language models can ﬁll in an incomplete sentence. •
Language models can scale to vast amount of data. 4 Proposed Method: Language model for molecule generation The quick brown fox ___ __ the lazy dog Language Model The quick brown fox walks around the lazy dog The quick brown fox jumps over the lazy dog The quick brown fox bumps into the lazy dog * Common Crawl Dataset contains nearly 1 trillion words[+] [+] T. Brown et al., “Language Models are Few-Shot Learners,” in Advances in Neural Information Processing Systems, 2020, vol. 33, pp. 1877–1901. [Online]. Available: https://proceedings.neurips.cc/paper/2020/ﬁle/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf Language Model * ZINC Dataset contains over 200 million compounds[^] [^] J. J. Irwin et al., “ZINC20—A Free Ultralarge-Scale Chemical Database for Ligand Discovery,” Journal of Chemical Information and Modeling, vol. 60, no. 12, pp. 6065–6073, 2020, doi: 10.1021/acs.jcim.0c00675.

• We formulate the problem as a language modeling problem,
where the model is able to generate a full sequence (sentence) from incomplete ones, or some prior knowledge. • Employ a state-of-the-art language model to encode a molecular latent space by generating valid molecule (sequences) from fragments and fragment compositions. 5 Proposed Method: SmilesFormer Model E D * Reconstruction Inputs Transformer[1] + VAE[2] Query Sequence Backprop

• Molecules are broken down into fragments using simple strategies
• Model inputs can be (i) full sequences, (ii) fragments, (iii) fragment compositions, and (iii) dummy fragments - “*”. • Fragments can be ‘composed’ using the [COMP] token 6 Proposed Method: Model Inputs RECAP[+] Fragmentation Fragment-on-bonds Full sequence transformation [*]c1ccc(S(N)(=O)=O)cc1[COMP][*]c1cc(C(F)(F)F)nn1[*][COMP][*]c1ccc(C)cc1 Fragment Composition [+] X. Q. Lewell, D. B. Judd, S. P. Watson, and M. M. Hann, “RECAPsRetrosynthetic Combinatorial Analysis Procedure: A Powerful New Technique for Identifying Privileged Molecular Fragments with Useful Applications in Combinatorial Chemistry,” p. 12.

7 Proposed Method: Molecule Optimization Encoder Output PLogP > 1.0
QED > 0.8 SA > 8.0 ... Multi-objective History of good latent variables 𝛼 ᐧ PropNN[+] Gradients V t-1 V t Decoder V t Proposed Samples Query Tokens CC1= 0 1 2 3 4 5 6 Optimization Steps V t-1 V t-h Backtrack [+] J. Mueller, D. Gifford, and T. Jaakkola, “Sequence to Better Sequence: Continuous Revision of Combinatorial Structures,” in Proceedings of the 34th International Conference on Machine Learning, Aug. 2017, vol. 70, pp. 2536–2544. [Online]. Available: https://proceedings.mlr.press/v70/mueller17a.html • Molecules with desired properties can then be generated by exploring the latent space through a gradient-based strategy which guides the latent variables to the chemical space that satisfy desired objectives. Latent space revision

8 Experiments • Distribution-directed Benchmarks: • Guacamol, MOSES • Goal-directed
Benchmarks (Multi-objective optimization) • JNK3 Inhibition: Inhibition against c-Jun N-terminal Kinases-3 (JNK3), belonging to the mitogen-activated protein kinase family, and are responsive to stress stimuli, such as cytokines, ultraviolet irradiation, heat shock, and osmotic shock. • GSK3β Inhibition: An enzyme associated with an increased susceptibility towards bipolar disorder. • Donepezil Rediscovery: Real use-case demonstration of drug rediscovery pipeline for Donepezil, a known acetylcholinesterase (AChE) inhibitor. • Backbone-constrained molecule optimization: Useful in lead optimization stages • Experiments with model parameters: • Model stochasticity • Fragmentation strategies • Full sequence suggestions • Start tokens constraints • Fragments analysis

9 Results: Distribution Benchmarks AAE Graph MCTS Random Sampler SMILES
LSTM VAE ORGAN SmilesFormer Validity 0.822 1 1 0.959 0.87 0.379 1 Uniqueness 1 1 0.997 1 0.999 0.841 1 Novelty 0.998 0.994 0 0.912 0.974 0.686 0.9958 KL divergence 0.886 0.522 0.998 0.991 0.982 0.267 0.8722 Frechet ChemNet Distance 0.529 0.015 0.929 0.913 0.863 0 0.1537 Model Valid (↑) Unique @1k (↑) Unique@ 10k (↑) FCD (↓) SNN (↑) Frag (↑) Scaf (↑) IntDiv (↑) IntDiv2 (↑) Filters (↑) Novelty (↑) Test TestSF Test TestSF Test TestSF Test TestSF Train 1 1 1 0.008 0.4755 0.6419 0.5859 1 0.9986 0.9907 0 0.8567 0.8508 1 1 HMM 0.076±0.0322 0.623 ±0.1224 0.5671 ±0.1424 24.4661 ±2.5251 25.4312 ±2.5599 0.3876 ±0.0107 0.3795 ±0.0107 0.5754 ±0.1224 0.5681 ±0.1218 0.2065 ±0.0481 0.049 ±0.018 0.8466 ±0.0403 0.8104 ±0.0507 0.9024 ±0.0489 0.9994±0.001 CharRNN 0.9748±0.0264 1.0±0.0 0.9994 ±0.0003 0.0732 ±0.0247 0.5204 ±0.0379 0.6015 ±0.0206 0.5649 ±0.0142 0.9998 ±0.0002 0.9983 ±0.0003 0.9242 ±0.0058 0.1101 ±0.0081 0.8562 ±0.0005 0.8503 ±0.0005 0.9943 ±0.0034 0.8419±0.0509 AAE 0.9368±0.0341 1.0±0.0 0.9973 ±0.002 0.5555 ±0.2033 1.0572 ±0.2375 0.6081 ±0.0043 0.5677 ±0.0045 0.991 ±0.0051 0.9905 ±0.0039 0.9022 ±0.0375 0.0789 ±0.009 0.8557 ±0.0031 0.8499 ±0.003 0.996 ±0.0006 0.7931±0.0285 VAE 0.9767±0.0012 1.0±0.0 0.9984 ±0.0005 0.099 ±0.0125 0.567 ±0.0338 0.6257 ±0.0005 0.5783 ±0.0008 0.9994 ±0.0001 0.9984 ±0.0003 0.9386 ±0.0021 0.0588 ±0.0095 0.8558 ±0.0004 0.8498 ±0.0004 0.997 ±0.0002 0.6949±0.0069 JTN-VAE 1.0±0.0 1.0±0.0 0.9996 ±0.0003 0.3954 ±0.0234 0.9382 ±0.0531 0.5477 ±0.0076 0.5194 ±0.007 0.9965 ±0.0003 0.9947 ±0.0002 0.8964 ±0.0039 0.1009 ±0.0105 0.8551 ±0.0034 0.8493 ±0.0035 0.976 ±0.0016 0.9143±0.0058 LatentGAN 0.8966±0.0029 1.0±0.0 0.9968 ±0.0002 0.2968 ±0.0087 0.8281 ±0.0117 0.5371 ±0.0004 0.5132 ±0.0002 0.9986 ±0.0004 0.9972 ±0.0007 0.8867 ±0.0009 0.1072 ±0.0098 0.8565 ±0.0007 0.8505 ±0.0006 0.9735 ±0.0006 0.9498±0.0006 SmilesFormer 1.0±0.0 1.0±0.0 1.0±0.0 15.665 ±0.04 16.467 ±0.001 0.4025 ±0.003 0.3903 ±0.005 0.8373 ±0.2 0.8583 ±0.0002 0.1438 ±0.004 0.06336 ±0.01 0.9144±0.0 0.9020 ±0.0 0.4947 ±0.003 0.99994±0.00001 • Guacamol[+] • MOSES[^] [+] N. Brown, M. Fiscato, M. H. S. Segler, and A. C. Vaucher, “GuacaMol: Benchmarking Models for de Novo Molecular Design,” J. Chem. Inf. Model., vol. 59, no. 3, pp. 1096–1108, Mar. 2019, doi: 10.1021/acs.jcim.8b00839. [^] D. Polykovskiy et al., “Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models,” arXiv:1811.12823 [cs, stat], Oct. 2020, Accessed: Nov. 21, 2021. [Online]. Available: http://arxiv.org/abs/1811.12823

10 Results: Goal-directed Benchmarks (Multi-objective optimization) [+] N. Brown, M.
Fiscato, M. H. S. Segler, and A. C. Vaucher, “GuacaMol: Benchmarking Models for de Novo Molecular Design,” J. Chem. Inf. Model., vol. 59, no. 3, pp. 1096–1108, Mar. 2019, doi: 10.1021/acs.jcim.8b00839. [^] D. Polykovskiy et al., “Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models,” arXiv:1811.12823 [cs, stat], Oct. 2020, Accessed: Nov. 21, 2021. [Online]. Available: http://arxiv.org/abs/1811.12823 Single Objective Optimization Multi Objective Optimization Oracle Eﬃciency during molecule optimization Top molecules for QED+SA+JNK3+GSK3β. The scores are SA, QED, JNK3, and GSK3β respectively

11 Results: Donepezil Rediscovery Model Pretraining • Training on large
scale dataset • Objective: Learn Smiles vocabulary of possible chemical space • Dataset ◦ ChEMBL (2.1M mols) ◦ Post-donepezil-disco very AChE-active molecules were excluded Model Fine-tuning on selected dataset • Dataset curation for active molecules on AChE. • Exclusion of donepezil scaffold and relevant molecules Molecule generation and optimization • Generate molecules and optimize with multi-property objectives (1) SA < 1, Target = 0; (2) QED > 0, Target = 1; (3) PLogP > 0, Target = 40; (4) Molecular Weight 250 < > 750; (5) Number of Hydrogen Bonds 5 < > 10; (6) Topological Polar Surface Area (TPSA) 75 < > 150; (7) Number of Rotatable Bonds 5 < > 10; (8) Similarity to tacrine Ci < 1, Target = 0; (9) Similarity to physostigmine < 1, Target = 0 (10) Similarity to rivastigmine Ci < 1, Target = 0; and (11) AChE inhibitor activity pIC50 (negative log IC50), determined by a trained predictive model on AChE Inhibitor active compounds Ci > 1, Target = 1 Molecules post-processing and analysis • Novelty ranking • Activity ranking • Scaffold analysis Stage 1 Stage 2

12 Results: Donepezil Rediscovery Top 10 scaffolds from the generated
molecules. The numbers signify how many times the scaffold appears out of 10 experiment runs. Donepezil Donepezil Scaffold Identiﬁed Molecule (Compound 14 from the original Donepezil paper[+]) [+] H. Sugimoto, H. Ogura, Y. Arai, Y. Iimura, and Y. Yamanishi, “Research and Development of Donepezil Hydrochloride, a New Type of Acetylcholinesterase Inhibitor,” Japanese Journal of Pharmacology, vol. 89, no. 1, pp. 7–20, 2002, doi: 10.1254/jjp.89.7.

13 Results: Backbone-constrained molecule generation Our model is able to
learn the generation of molecules with constrained backbones. The molecule in the green box is the input backbone. The generated SMILES string and Tanimoto similarity score to the input backbone is shown

14 Results: Backbone-constrained molecule generation Our model is able to
learn the generation of molecules with constrained backbones. The molecule in the green box is the input backbone. The generated SMILES string and Tanimoto similarity score to the input backbone is shown

15 Results: Experiments with model parameters . Molecule generation without
fragmentation Molecule generation with fragmentation

16 Conclusions • We proposed and demonstrated a language model
that learns SMILES vocabulary in order to generate novel compounds • The trained model can be used in molecular optimization task using customizable parameters • We demonstrated the application of our model in various de novo molecular design tasks include a real use case of donepezil rediscovery Future Directions: • Property-aware molecule generation • Matching molecular pairs generation

www.elix-inc.com

18 Appendix

19 Input Tokenization • Smiles strings are tokenized based on
a predeﬁned dictionary.[+] • A dictionary of 591 tokens is used. • Tokens are characters or group of characters that form a single unit. Token Dict ID C 16 [C@H] 33 [N+] 41 [SiH2] 127 [PAD] 0 special, for padding [COMP] 6 special, for composition [BOC] 9 special, for Beginning of Compound [EOC] 10 special, for End of Compound [UNK] 11 special, for unknown tokens [+] P. Schwaller et al., “Molecular Transformer: A Model for Uncertainty-Calibrated Chemical Reaction Prediction,” ACS Cent. Sci., vol. 5, no. 9, pp. 1572–1583, Sep. 2019, doi: 10.1021/acscentsci.9b00576.

2 Model Training and Evaluation Details Model parameters 12 Layers
12 heads latent space embedding dimension: 288, 0.15 dropout Training: Dataset constraints: • Max molweight 900, • Max token length 100 Datasets collection Name Size Description Small Datasets 872,557 Small datasets from predictive model benchmarks ChEMBL 3.0 3,429,071 Elix Small + Guacamol (train) + Molsets (train)

21 Results: Experiments with model parameters .

SmilesFormer: Language Model for Molecular Desi...

SmilesFormer: Language Model for Molecular Design, Elix, CBI 2022

Elix

More Decks by Elix

Other Decks in Technology

Featured

Transcript

SmilesFormer: Language Model for Molecular Design Elix, Inc. Chem-Bio Informatics

2 Introduction • Generative models play a major role in

• Language models can ﬁll in an incomplete sentence. •

• Language models can ﬁll in an incomplete sentence. •

• We formulate the problem as a language modeling problem,

• Molecules are broken down into fragments using simple strategies

7 Proposed Method: Molecule Optimization Encoder Output PLogP > 1.0

8 Experiments • Distribution-directed Benchmarks: • Guacamol, MOSES • Goal-directed

9 Results: Distribution Benchmarks AAE Graph MCTS Random Sampler SMILES

10 Results: Goal-directed Benchmarks (Multi-objective optimization) [+] N. Brown, M.

11 Results: Donepezil Rediscovery Model Pretraining • Training on large

12 Results: Donepezil Rediscovery Top 10 scaffolds from the generated

13 Results: Backbone-constrained molecule generation Our model is able to

14 Results: Backbone-constrained molecule generation Our model is able to

15 Results: Experiments with model parameters . Molecule generation without

16 Conclusions • We proposed and demonstrated a language model

www.elix-inc.com

18 Appendix

19 Input Tokenization • Smiles strings are tokenized based on

2 Model Training and Evaluation Details Model parameters 12 Layers

21 Results: Experiments with model parameters .