Upgrade to Pro — share decks privately, control downloads, hide ads and more …

SmilesFormer: Language Model for Molecular Design, Elix, CBI 2022

Elix
October 27, 2022

SmilesFormer: Language Model for Molecular Design, Elix, CBI 2022

Elix

October 27, 2022
Tweet

More Decks by Elix

Other Decks in Technology

Transcript

  1. SmilesFormer: Language Model for
    Molecular Design
    Elix, Inc.
    Chem-Bio Informatics Society (CBI) Annual Meeting 2022, Tokyo Japan | October 26, 2022
    Joshua Owoyemi, Ph.D & Nazim Medzhidov, Ph.D

    View Slide

  2. 2
    Introduction
    ● Generative models play a major role in discovering and designing new molecules, which is key to innovation in
    in-silico drugs discovery
    ● There is a vast amount of molecular data available, therefore generative models should be able to learn
    concepts of valid and desired molecules using a data-centric approach.
    Challenges:
    ● Model that can take advantage of vast available dataset to learn efficient molecular representations
    ● Need for methods to efficiently transverse possible chemical space in order to generate molecules that satisfy
    desired multi-objective.
    ● Demonstration of how proposed method can be utilized in a practical use case.

    View Slide

  3. ● Language models can fill in an incomplete sentence.
    ● Recent language models can scale to vast amount of data.
    3
    Proposed Method: Language model for molecule generation
    The quick brown fox ___ __ the lazy dog
    Language
    Model
    The quick brown fox walks around the lazy dog
    The quick brown fox jumps over the lazy dog
    The quick brown fox bumps into the lazy dog
    *
    Common Crawl Dataset contains nearly 1 trillion words[+]
    [+] T. Brown et al., “Language Models are Few-Shot Learners,” in Advances in Neural Information Processing Systems,
    2020, vol. 33, pp. 1877–1901. [Online]. Available:
    https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf

    View Slide

  4. ● Language models can fill in an incomplete sentence.
    ● Language models can scale to vast amount of data.
    4
    Proposed Method: Language model for molecule generation
    The quick brown fox ___ __ the lazy dog
    Language
    Model
    The quick brown fox walks around the lazy dog
    The quick brown fox jumps over the lazy dog
    The quick brown fox bumps into the lazy dog
    *
    Common Crawl Dataset contains nearly 1 trillion words[+]
    [+] T. Brown et al., “Language Models are Few-Shot Learners,” in Advances in Neural Information Processing Systems,
    2020, vol. 33, pp. 1877–1901. [Online]. Available:
    https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf
    Language
    Model
    *
    ZINC Dataset contains over 200 million compounds[^]
    [^] J. J. Irwin et al., “ZINC20—A Free Ultralarge-Scale Chemical Database for Ligand Discovery,” Journal of Chemical
    Information and Modeling, vol. 60, no. 12, pp. 6065–6073, 2020, doi: 10.1021/acs.jcim.0c00675.

    View Slide

  5. ● We formulate the problem as a language modeling problem, where the model is able to generate a full
    sequence (sentence) from incomplete ones, or some prior knowledge.
    ● Employ a state-of-the-art language model to encode a molecular latent space by generating valid
    molecule (sequences) from fragments and fragment compositions.
    5
    Proposed Method: SmilesFormer Model
    E
    D
    *
    Reconstruction
    Inputs Transformer[1] + VAE[2]
    Query Sequence
    Backprop

    View Slide

  6. ● Molecules are broken down into fragments using simple strategies
    ● Model inputs can be (i) full sequences, (ii) fragments, (iii) fragment compositions, and (iii) dummy
    fragments - “*”.
    ● Fragments can be ‘composed’ using the [COMP] token
    6
    Proposed Method: Model Inputs
    RECAP[+] Fragmentation
    Fragment-on-bonds
    Full sequence
    transformation
    [*]c1ccc(S(N)(=O)=O)cc1[COMP][*]c1cc(C(F)(F)F)nn1[*][COMP][*]c1ccc(C)cc1 Fragment Composition
    [+] X. Q. Lewell, D. B. Judd, S. P. Watson, and M. M. Hann, “RECAPsRetrosynthetic Combinatorial Analysis Procedure: A
    Powerful New Technique for Identifying Privileged Molecular Fragments with Useful Applications in Combinatorial
    Chemistry,” p. 12.

    View Slide

  7. 7
    Proposed Method: Molecule Optimization
    Encoder Output
    PLogP > 1.0
    QED > 0.8
    SA > 8.0
    ...
    Multi-objective
    History of good
    latent variables
    𝛼

    PropNN[+]
    Gradients
    V
    t-1
    V
    t
    Decoder
    V
    t
    Proposed
    Samples
    Query Tokens
    CC1=
    0
    1
    2
    3
    4
    5
    6
    Optimization
    Steps
    V
    t-1
    V
    t-h
    Backtrack
    [+] J. Mueller, D. Gifford, and T. Jaakkola, “Sequence to Better Sequence: Continuous Revision of Combinatorial
    Structures,” in Proceedings of the 34th International Conference on Machine Learning, Aug. 2017, vol. 70, pp.
    2536–2544. [Online]. Available: https://proceedings.mlr.press/v70/mueller17a.html
    ● Molecules with desired properties can then be generated by exploring the latent space through a
    gradient-based strategy which guides the latent variables to the chemical space that satisfy desired
    objectives.
    Latent space revision

    View Slide

  8. 8
    Experiments
    • Distribution-directed Benchmarks:
    • Guacamol, MOSES
    • Goal-directed Benchmarks (Multi-objective optimization)
    • JNK3 Inhibition: Inhibition against c-Jun N-terminal Kinases-3 (JNK3), belonging to the mitogen-activated
    protein kinase family, and are responsive to stress stimuli, such as cytokines, ultraviolet irradiation, heat
    shock, and osmotic shock.
    • GSK3β Inhibition: An enzyme associated with an increased susceptibility towards bipolar disorder.
    • Donepezil Rediscovery: Real use-case demonstration of drug rediscovery pipeline for Donepezil, a known
    acetylcholinesterase (AChE) inhibitor.
    • Backbone-constrained molecule optimization: Useful in lead optimization stages
    • Experiments with model parameters:
    • Model stochasticity
    • Fragmentation strategies
    • Full sequence suggestions
    • Start tokens constraints
    • Fragments analysis

    View Slide

  9. 9
    Results: Distribution Benchmarks
    AAE Graph MCTS Random Sampler SMILES LSTM VAE ORGAN SmilesFormer
    Validity 0.822 1 1 0.959 0.87 0.379 1
    Uniqueness 1 1 0.997 1 0.999 0.841 1
    Novelty 0.998 0.994 0 0.912 0.974 0.686 0.9958
    KL divergence 0.886 0.522 0.998 0.991 0.982 0.267 0.8722
    Frechet ChemNet Distance 0.529 0.015 0.929 0.913 0.863 0 0.1537
    Model Valid (↑)
    Unique
    @1k (↑)
    [email protected]
    10k (↑)
    FCD (↓) SNN (↑) Frag (↑) Scaf (↑)
    IntDiv (↑)
    IntDiv2
    (↑) Filters (↑) Novelty (↑)
    Test TestSF Test TestSF Test TestSF Test TestSF
    Train 1 1 1 0.008 0.4755 0.6419 0.5859 1 0.9986 0.9907 0 0.8567 0.8508 1 1
    HMM 0.076±0.0322
    0.623
    ±0.1224
    0.5671
    ±0.1424
    24.4661
    ±2.5251
    25.4312
    ±2.5599
    0.3876
    ±0.0107
    0.3795
    ±0.0107
    0.5754
    ±0.1224
    0.5681
    ±0.1218
    0.2065
    ±0.0481
    0.049
    ±0.018
    0.8466
    ±0.0403
    0.8104
    ±0.0507
    0.9024
    ±0.0489 0.9994±0.001
    CharRNN 0.9748±0.0264 1.0±0.0
    0.9994
    ±0.0003
    0.0732
    ±0.0247
    0.5204
    ±0.0379
    0.6015
    ±0.0206
    0.5649
    ±0.0142
    0.9998
    ±0.0002
    0.9983
    ±0.0003
    0.9242
    ±0.0058
    0.1101
    ±0.0081
    0.8562
    ±0.0005
    0.8503
    ±0.0005
    0.9943
    ±0.0034 0.8419±0.0509
    AAE 0.9368±0.0341 1.0±0.0
    0.9973
    ±0.002
    0.5555
    ±0.2033
    1.0572
    ±0.2375
    0.6081
    ±0.0043
    0.5677
    ±0.0045
    0.991
    ±0.0051
    0.9905
    ±0.0039
    0.9022
    ±0.0375
    0.0789
    ±0.009
    0.8557
    ±0.0031
    0.8499
    ±0.003
    0.996
    ±0.0006 0.7931±0.0285
    VAE 0.9767±0.0012 1.0±0.0
    0.9984
    ±0.0005
    0.099
    ±0.0125
    0.567
    ±0.0338
    0.6257
    ±0.0005
    0.5783
    ±0.0008
    0.9994
    ±0.0001
    0.9984
    ±0.0003
    0.9386
    ±0.0021
    0.0588
    ±0.0095
    0.8558
    ±0.0004
    0.8498
    ±0.0004
    0.997
    ±0.0002 0.6949±0.0069
    JTN-VAE 1.0±0.0 1.0±0.0
    0.9996
    ±0.0003
    0.3954
    ±0.0234
    0.9382
    ±0.0531
    0.5477
    ±0.0076
    0.5194
    ±0.007
    0.9965
    ±0.0003
    0.9947
    ±0.0002
    0.8964
    ±0.0039
    0.1009
    ±0.0105
    0.8551
    ±0.0034
    0.8493
    ±0.0035
    0.976
    ±0.0016 0.9143±0.0058
    LatentGAN 0.8966±0.0029 1.0±0.0
    0.9968
    ±0.0002
    0.2968
    ±0.0087
    0.8281
    ±0.0117
    0.5371
    ±0.0004
    0.5132
    ±0.0002
    0.9986
    ±0.0004
    0.9972
    ±0.0007
    0.8867
    ±0.0009
    0.1072
    ±0.0098
    0.8565
    ±0.0007
    0.8505
    ±0.0006
    0.9735
    ±0.0006 0.9498±0.0006
    SmilesFormer 1.0±0.0 1.0±0.0 1.0±0.0
    15.665
    ±0.04
    16.467
    ±0.001
    0.4025
    ±0.003
    0.3903
    ±0.005
    0.8373
    ±0.2
    0.8583
    ±0.0002
    0.1438
    ±0.004
    0.06336
    ±0.01 0.9144±0.0
    0.9020
    ±0.0
    0.4947
    ±0.003 0.99994±0.00001
    • Guacamol[+]
    • MOSES[^]
    [+] N. Brown, M. Fiscato, M. H. S. Segler, and A. C. Vaucher, “GuacaMol: Benchmarking Models for de Novo
    Molecular Design,” J. Chem. Inf. Model., vol. 59, no. 3, pp. 1096–1108, Mar. 2019, doi:
    10.1021/acs.jcim.8b00839.
    [^] D. Polykovskiy et al., “Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation
    Models,” arXiv:1811.12823 [cs, stat], Oct. 2020, Accessed: Nov. 21, 2021. [Online]. Available:
    http://arxiv.org/abs/1811.12823

    View Slide

  10. 10
    Results: Goal-directed Benchmarks (Multi-objective optimization)
    [+] N. Brown, M. Fiscato, M. H. S. Segler, and A. C. Vaucher, “GuacaMol: Benchmarking Models for de Novo
    Molecular Design,” J. Chem. Inf. Model., vol. 59, no. 3, pp. 1096–1108, Mar. 2019, doi:
    10.1021/acs.jcim.8b00839.
    [^] D. Polykovskiy et al., “Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation
    Models,” arXiv:1811.12823 [cs, stat], Oct. 2020, Accessed: Nov. 21, 2021. [Online]. Available:
    http://arxiv.org/abs/1811.12823
    Single Objective Optimization Multi Objective Optimization
    Oracle Efficiency during molecule optimization
    Top molecules for
    QED+SA+JNK3+GSK3β.
    The scores are SA, QED,
    JNK3, and GSK3β
    respectively

    View Slide

  11. 11
    Results: Donepezil Rediscovery
    Model Pretraining
    ● Training on large scale
    dataset
    ● Objective: Learn Smiles
    vocabulary of possible
    chemical space
    ● Dataset
    ○ ChEMBL (2.1M mols)
    ○ Post-donepezil-disco
    very AChE-active
    molecules were
    excluded
    Model
    Fine-tuning on
    selected
    dataset
    ● Dataset curation for
    active molecules on
    AChE.
    ● Exclusion of
    donepezil scaffold
    and relevant
    molecules
    Molecule
    generation and
    optimization
    ● Generate molecules
    and optimize with
    multi-property
    objectives
    (1) SA < 1, Target = 0; (2) QED > 0,
    Target = 1; (3) PLogP > 0, Target = 40;
    (4) Molecular Weight 250 < > 750; (5)
    Number of Hydrogen Bonds 5 < > 10; (6)
    Topological Polar Surface Area (TPSA)
    75 < > 150; (7) Number of Rotatable
    Bonds 5 < > 10; (8) Similarity to tacrine
    Ci < 1, Target = 0; (9) Similarity to
    physostigmine < 1, Target = 0 (10)
    Similarity to rivastigmine Ci < 1, Target =
    0; and (11) AChE inhibitor activity pIC50
    (negative log IC50), determined by a
    trained predictive model on AChE
    Inhibitor active compounds Ci > 1,
    Target = 1
    Molecules
    post-processing
    and analysis
    ● Novelty ranking
    ● Activity ranking
    ● Scaffold analysis
    Stage 1 Stage 2

    View Slide

  12. 12
    Results: Donepezil Rediscovery
    Top 10 scaffolds from the
    generated molecules. The
    numbers signify how
    many times the scaffold
    appears out of 10
    experiment runs.
    Donepezil Donepezil Scaffold Identified Molecule
    (Compound 14 from the
    original Donepezil paper[+])
    [+] H. Sugimoto, H. Ogura, Y. Arai, Y. Iimura, and Y. Yamanishi, “Research and Development of Donepezil
    Hydrochloride, a New Type of Acetylcholinesterase Inhibitor,” Japanese Journal of Pharmacology, vol. 89, no.
    1, pp. 7–20, 2002, doi: 10.1254/jjp.89.7.

    View Slide

  13. 13
    Results: Backbone-constrained molecule generation
    Our model is able to learn the generation of molecules with constrained backbones.
    The molecule in the green box is the input backbone. The generated SMILES
    string and Tanimoto similarity score to the input backbone is shown

    View Slide

  14. 14
    Results: Backbone-constrained molecule generation
    Our model is able to learn the generation of molecules with constrained backbones.
    The molecule in the green box is the input backbone. The generated SMILES
    string and Tanimoto similarity score to the input backbone is shown

    View Slide

  15. 15
    Results: Experiments with model parameters
    .
    Molecule generation without fragmentation
    Molecule generation with fragmentation

    View Slide

  16. 16
    Conclusions
    ● We proposed and demonstrated a language model that learns SMILES vocabulary in order to generate novel
    compounds
    ● The trained model can be used in molecular optimization task using customizable parameters
    ● We demonstrated the application of our model in various de novo molecular design tasks include a real use
    case of donepezil rediscovery
    Future Directions:
    ● Property-aware molecule generation
    ● Matching molecular pairs generation

    View Slide

  17. www.elix-inc.com

    View Slide

  18. 18
    Appendix

    View Slide

  19. 19
    Input Tokenization
    • Smiles strings are tokenized based on a predefined dictionary.[+]
    • A dictionary of 591 tokens is used.
    • Tokens are characters or group of characters that form a single unit.
    Token Dict ID
    C 16
    [[email protected]] 33
    [N+] 41
    [SiH2] 127
    [PAD] 0 special, for padding
    [COMP] 6 special, for composition
    [BOC] 9 special, for Beginning of Compound
    [EOC] 10 special, for End of Compound
    [UNK] 11 special, for unknown tokens
    [+] P. Schwaller et al., “Molecular Transformer: A Model for Uncertainty-Calibrated Chemical Reaction Prediction,” ACS
    Cent. Sci., vol. 5, no. 9, pp. 1572–1583, Sep. 2019, doi: 10.1021/acscentsci.9b00576.

    View Slide

  20. 2
    Model Training and Evaluation Details
    Model parameters
    12 Layers
    12 heads
    latent space embedding
    dimension: 288,
    0.15 dropout
    Training:
    Dataset constraints:
    ● Max molweight 900,
    ● Max token length 100
    Datasets collection
    Name Size Description
    Small Datasets 872,557 Small datasets from
    predictive model
    benchmarks
    ChEMBL 3.0 3,429,071 Elix Small + Guacamol
    (train) + Molsets (train)

    View Slide

  21. 21
    Results: Experiments with model parameters
    .

    View Slide