Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Towards Generating Synthesizable de novo small ...

Elix
October 26, 2021

Towards Generating Synthesizable de novo small Molecules, Elix, CBI 2021

Elix

October 26, 2021
Tweet

More Decks by Elix

Other Decks in Technology

Transcript

  1. Outline 1. Introduction - Generative Models 2. Generation Paradigm a.

    Generation by Bias Training Set b. Generation by Heuristics c. Generation by Oracle d. Generation by Explicit Constraints 3. Synthesizability vs. Novelty - The Challenges a. Retrosynthesis b. Forward Prediction 4. Conclusion 2
  2. 3 Introduction Generative Models Graph Representation of a Molecule A

    machine learning, probabilistic perspective Encoder Model Estimation Decoder Decoded Molecule By definition, a generative model learns the joint distribution , where is the data and is the latent variable.
  3. 5 Generation by Bias Training Set Force the model to

    learn from synthesizable starting molecules. Synthesizable Molecules Non-Synthesizable Molecules Molecular Space Advantages: - We explore molecules that are synthesizable from the start, giving us higher leverage on success. - Fast and easy to train Disadvantages: - Local optima problem - Limited novelty of molecules
  4. 6 Generation by Heuristics A common approach: Train a generative

    model by teaching it what is synthesize and what is not. Example: Change the loss function to incorporate synthesizability loss Backpropagate Loss Models that utilizes rewards (reinforcement learning) may consider these reward signals as additional constraints (i.e. REINVENT) Advantages: - easy to implement Disadvantages: - difficult to train - heuristic may be naive
  5. 7 Generation by Oracle A common approach: Train a generative

    model by teaching it what is synthesize and what is not. Keep the loss function the same, replace the heuristic with a proxy/oracle (computer-aided synthesis planning CASP). Advantages: - easy to implement - less likely to fall into local optima Disadvantages: - accuracy subject to the quality of CASP - computationally expensive
  6. 8 Generation by Explicit Constraints Synthesizable compounds are not necessary

    purchasable or accessible easily Synthesizable Molecules Non-Synthesizable Molecules Molecular Space Advantages: - Multi-step synthesizability can be considered - Generated molecules are guaranteed to be more accessible - Most purchasable compounds are public knowledge Disadvantages: - Search space becomes smaller - Computationally more expensive - More dependencies Purchasable Compounds
  7. 9 General Takeaway The more constraints (synthesizability), the smaller the

    search space, leading to less chances of discovering more novel molecules Novelty exploration Synthesizability exploitation The more more a molecules gets, the more difficult it is to synthesize it. These molecules tend to be unrealistic (no viable synthesizable pathways, or are too expensive to be synthesized). Discover unknown molecule from a large chemical search space Identify reverse synthesizability pathway of such novel compound Predict next molecule from synthesizable compound Perform combinatorial matching for new compound discovery Training for consideration of novelty requires an equally powerful retrosynthesis model Training for synthesizability requires an equally powerful predictive model for forward reaction The Compromise? Exploitation vs. Exploration
  8. 10 Challenges - Retrosynthesis Training for consideration of novelty requires

    an equally powerful retrosynthesis model - Retrosynthesis error has room for error - Training of retrosynthesis model is challenging - Requires reaction prediction data - Limited to datasets like USPTO or proprietary dataset like Reaxys - The correctness of these datasets may sometimes be inaccurate Example of AiZynthFinder GUI interface
  9. 11 Challenges - Forward Reaction Predictive Model Example of RexGen

    Framework - In its simplest form, a template is applied on certain reactants to obtain the resulting reaction. - More complex forward prediction model requires large training data - dataset are similar to retrosynthesis - similar problems are retrosynthesis datasets - Ideal conditions and mixture ratios are often absent - Some models like RexGen are available to public but model training is not replicable
  10. 12 Conclusion Synthesizable molecules are crucial for drug development, but

    they come with several caveats: - they restrict the search space considerably - they rely on forward reaction predictive models for multi-step generation The advantages are: - more realistic - starting compounds could be more procurable Generating novel de novo molecules are useful for discovering new molecules, but they come with several caveats: - they are usually one-step generations (reactants -> end product) - they rely on retrosynthesis for realistic applications The advantages are: - more room for creativity - molecules may have better properties Alternative considerations: - Design better heuristic methods for assessing synthesizability
  11. 13 What about Heuristics? Heuristics are fast, but often simplified:

    • SA Score measures the complexity of compounds using fragment-contribution approach; rarer fragments are taken as an indication of lower synthesizability • SC Score is learned synthetic complexity score computed by a neural network trained from reaction data (Reaxys) • SMILES length is a naive heuristic that associates longer SMILES strings as more synthetically difficulty We should consider alternatives or design better heuristics, answering the question: • What specifically contributes to synthetic complexity? • How does one classify the synthesizability of compounds that are easily obtainable (off-the-shelf), but difficult to synthesize?
  12. 15 References [1] W. Gao and C. W. Coley, “The

    Synthesizability of Molecules Proposed by Generative Models,” J. Chem. Inf. Model., Apr. 2020 [2] J. Bradshaw, B. Paige, M. J. Kusner, M. Segler, and J. M. Hernández-Lobato, “A Model to Search for Synthesizable Molecules,” in Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. Alché-Buc, E. Fox, and R. Garnett, Eds. Curran Associates, Inc., 2019, pp. 7937–7949. [3] J. Bradshaw, B. Paige, M. J. Kusner, M. Segler, and J. M. Hernández-Lobato, “Barking up the right tree: an approach to search over molecule synthesis DAGs,” Advances in Neural Information Processing Systems, vol. 33, 2020 [4] K. Korovina et al., “ChemBO: Bayesian Optimization of Small Organic Molecules with Synthesizable Recommendations,” in International Conference on Artificial Intelligence and Statistics, Jun. 2020, pp. 3393–3403.