discovering and designing new molecules, which is key to innovation in in-silico drugs discovery • There is a vast amount of molecular data available, therefore generative models should be able to learn concepts of valid and desired molecules using a data-centric approach. Challenges: • Model that can take advantage of vast available dataset to learn efficient molecular representations • Need for methods to efficiently transverse possible chemical space in order to generate molecules that satisfy desired multi-objective. • Demonstration of how proposed method can be utilized in a practical use case.
Recent language models can scale to vast amount of data. 3 Proposed Method: Language model for molecule generation The quick brown fox ___ __ the lazy dog Language Model The quick brown fox walks around the lazy dog The quick brown fox jumps over the lazy dog The quick brown fox bumps into the lazy dog * Common Crawl Dataset contains nearly 1 trillion words[+] [+] T. Brown et al., “Language Models are Few-Shot Learners,” in Advances in Neural Information Processing Systems, 2020, vol. 33, pp. 1877–1901. [Online]. Available: https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf
Language models can scale to vast amount of data. 4 Proposed Method: Language model for molecule generation The quick brown fox ___ __ the lazy dog Language Model The quick brown fox walks around the lazy dog The quick brown fox jumps over the lazy dog The quick brown fox bumps into the lazy dog * Common Crawl Dataset contains nearly 1 trillion words[+] [+] T. Brown et al., “Language Models are Few-Shot Learners,” in Advances in Neural Information Processing Systems, 2020, vol. 33, pp. 1877–1901. [Online]. Available: https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf Language Model * ZINC Dataset contains over 200 million compounds[^] [^] J. J. Irwin et al., “ZINC20—A Free Ultralarge-Scale Chemical Database for Ligand Discovery,” Journal of Chemical Information and Modeling, vol. 60, no. 12, pp. 6065–6073, 2020, doi: 10.1021/acs.jcim.0c00675.
where the model is able to generate a full sequence (sentence) from incomplete ones, or some prior knowledge. • Employ a state-of-the-art language model to encode a molecular latent space by generating valid molecule (sequences) from fragments and fragment compositions. 5 Proposed Method: SmilesFormer Model E D * Reconstruction Inputs Transformer[1] + VAE[2] Query Sequence Backprop
• Model inputs can be (i) full sequences, (ii) fragments, (iii) fragment compositions, and (iii) dummy fragments - “*”. • Fragments can be ‘composed’ using the [COMP] token 6 Proposed Method: Model Inputs RECAP[+] Fragmentation Fragment-on-bonds Full sequence transformation [*]c1ccc(S(N)(=O)=O)cc1[COMP][*]c1cc(C(F)(F)F)nn1[*][COMP][*]c1ccc(C)cc1 Fragment Composition [+] X. Q. Lewell, D. B. Judd, S. P. Watson, and M. M. Hann, “RECAPsRetrosynthetic Combinatorial Analysis Procedure: A Powerful New Technique for Identifying Privileged Molecular Fragments with Useful Applications in Combinatorial Chemistry,” p. 12.
QED > 0.8 SA > 8.0 ... Multi-objective History of good latent variables 𝛼 ᐧ PropNN[+] Gradients V t-1 V t Decoder V t Proposed Samples Query Tokens CC1= 0 1 2 3 4 5 6 Optimization Steps V t-1 V t-h Backtrack [+] J. Mueller, D. Gifford, and T. Jaakkola, “Sequence to Better Sequence: Continuous Revision of Combinatorial Structures,” in Proceedings of the 34th International Conference on Machine Learning, Aug. 2017, vol. 70, pp. 2536–2544. [Online]. Available: https://proceedings.mlr.press/v70/mueller17a.html • Molecules with desired properties can then be generated by exploring the latent space through a gradient-based strategy which guides the latent variables to the chemical space that satisfy desired objectives. Latent space revision
Benchmarks (Multi-objective optimization) • JNK3 Inhibition: Inhibition against c-Jun N-terminal Kinases-3 (JNK3), belonging to the mitogen-activated protein kinase family, and are responsive to stress stimuli, such as cytokines, ultraviolet irradiation, heat shock, and osmotic shock. • GSK3β Inhibition: An enzyme associated with an increased susceptibility towards bipolar disorder. • Donepezil Rediscovery: Real use-case demonstration of drug rediscovery pipeline for Donepezil, a known acetylcholinesterase (AChE) inhibitor. • Backbone-constrained molecule optimization: Useful in lead optimization stages • Experiments with model parameters: • Model stochasticity • Fragmentation strategies • Full sequence suggestions • Start tokens constraints • Fragments analysis
Fiscato, M. H. S. Segler, and A. C. Vaucher, “GuacaMol: Benchmarking Models for de Novo Molecular Design,” J. Chem. Inf. Model., vol. 59, no. 3, pp. 1096–1108, Mar. 2019, doi: 10.1021/acs.jcim.8b00839. [^] D. Polykovskiy et al., “Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models,” arXiv:1811.12823 [cs, stat], Oct. 2020, Accessed: Nov. 21, 2021. [Online]. Available: http://arxiv.org/abs/1811.12823 Single Objective Optimization Multi Objective Optimization Oracle Efficiency during molecule optimization Top molecules for QED+SA+JNK3+GSK3β. The scores are SA, QED, JNK3, and GSK3β respectively
molecules. The numbers signify how many times the scaffold appears out of 10 experiment runs. Donepezil Donepezil Scaffold Identified Molecule (Compound 14 from the original Donepezil paper[+]) [+] H. Sugimoto, H. Ogura, Y. Arai, Y. Iimura, and Y. Yamanishi, “Research and Development of Donepezil Hydrochloride, a New Type of Acetylcholinesterase Inhibitor,” Japanese Journal of Pharmacology, vol. 89, no. 1, pp. 7–20, 2002, doi: 10.1254/jjp.89.7.
learn the generation of molecules with constrained backbones. The molecule in the green box is the input backbone. The generated SMILES string and Tanimoto similarity score to the input backbone is shown
learn the generation of molecules with constrained backbones. The molecule in the green box is the input backbone. The generated SMILES string and Tanimoto similarity score to the input backbone is shown
that learns SMILES vocabulary in order to generate novel compounds • The trained model can be used in molecular optimization task using customizable parameters • We demonstrated the application of our model in various de novo molecular design tasks include a real use case of donepezil rediscovery Future Directions: • Property-aware molecule generation • Matching molecular pairs generation
a predefined dictionary.[+] • A dictionary of 591 tokens is used. • Tokens are characters or group of characters that form a single unit. Token Dict ID C 16 [C@H] 33 [N+] 41 [SiH2] 127 [PAD] 0 special, for padding [COMP] 6 special, for composition [BOC] 9 special, for Beginning of Compound [EOC] 10 special, for End of Compound [UNK] 11 special, for unknown tokens [+] P. Schwaller et al., “Molecular Transformer: A Model for Uncertainty-Calibrated Chemical Reaction Prediction,” ACS Cent. Sci., vol. 5, no. 9, pp. 1572–1583, Sep. 2019, doi: 10.1021/acscentsci.9b00576.