Towards Generating Synthesizable de novo small Molecules, Elix, CBI 2021

Towards Generating Synthesizable de novo small Molecules Jun Jin Choong,
PhD Research Engineer Elix Inc. 2021/10/26

Outline 1. Introduction - Generative Models 2. Generation Paradigm a.
Generation by Bias Training Set b. Generation by Heuristics c. Generation by Oracle d. Generation by Explicit Constraints 3. Synthesizability vs. Novelty - The Challenges a. Retrosynthesis b. Forward Prediction 4. Conclusion 2

3 Introduction Generative Models Graph Representation of a Molecule A
machine learning, probabilistic perspective Encoder Model Estimation Decoder Decoded Molecule By deﬁnition, a generative model learns the joint distribution , where is the data and is the latent variable.

4 Generation Paradigm Various combinations of synthesizability conﬁgurations. Figure adapted
from Gao et. al (2020)

5 Generation by Bias Training Set Force the model to
learn from synthesizable starting molecules. Synthesizable Molecules Non-Synthesizable Molecules Molecular Space Advantages: - We explore molecules that are synthesizable from the start, giving us higher leverage on success. - Fast and easy to train Disadvantages: - Local optima problem - Limited novelty of molecules

6 Generation by Heuristics A common approach: Train a generative
model by teaching it what is synthesize and what is not. Example: Change the loss function to incorporate synthesizability loss Backpropagate Loss Models that utilizes rewards (reinforcement learning) may consider these reward signals as additional constraints (i.e. REINVENT) Advantages: - easy to implement Disadvantages: - diﬃcult to train - heuristic may be naive

7 Generation by Oracle A common approach: Train a generative
model by teaching it what is synthesize and what is not. Keep the loss function the same, replace the heuristic with a proxy/oracle (computer-aided synthesis planning CASP). Advantages: - easy to implement - less likely to fall into local optima Disadvantages: - accuracy subject to the quality of CASP - computationally expensive

8 Generation by Explicit Constraints Synthesizable compounds are not necessary
purchasable or accessible easily Synthesizable Molecules Non-Synthesizable Molecules Molecular Space Advantages: - Multi-step synthesizability can be considered - Generated molecules are guaranteed to be more accessible - Most purchasable compounds are public knowledge Disadvantages: - Search space becomes smaller - Computationally more expensive - More dependencies Purchasable Compounds

9 General Takeaway The more constraints (synthesizability), the smaller the
search space, leading to less chances of discovering more novel molecules Novelty exploration Synthesizability exploitation The more more a molecules gets, the more diﬃcult it is to synthesize it. These molecules tend to be unrealistic (no viable synthesizable pathways, or are too expensive to be synthesized). Discover unknown molecule from a large chemical search space Identify reverse synthesizability pathway of such novel compound Predict next molecule from synthesizable compound Perform combinatorial matching for new compound discovery Training for consideration of novelty requires an equally powerful retrosynthesis model Training for synthesizability requires an equally powerful predictive model for forward reaction The Compromise? Exploitation vs. Exploration

10 Challenges - Retrosynthesis Training for consideration of novelty requires
an equally powerful retrosynthesis model - Retrosynthesis error has room for error - Training of retrosynthesis model is challenging - Requires reaction prediction data - Limited to datasets like USPTO or proprietary dataset like Reaxys - The correctness of these datasets may sometimes be inaccurate Example of AiZynthFinder GUI interface

11 Challenges - Forward Reaction Predictive Model Example of RexGen
Framework - In its simplest form, a template is applied on certain reactants to obtain the resulting reaction. - More complex forward prediction model requires large training data - dataset are similar to retrosynthesis - similar problems are retrosynthesis datasets - Ideal conditions and mixture ratios are often absent - Some models like RexGen are available to public but model training is not replicable

12 Conclusion Synthesizable molecules are crucial for drug development, but
they come with several caveats: - they restrict the search space considerably - they rely on forward reaction predictive models for multi-step generation The advantages are: - more realistic - starting compounds could be more procurable Generating novel de novo molecules are useful for discovering new molecules, but they come with several caveats: - they are usually one-step generations (reactants -> end product) - they rely on retrosynthesis for realistic applications The advantages are: - more room for creativity - molecules may have better properties Alternative considerations: - Design better heuristic methods for assessing synthesizability

13 What about Heuristics? Heuristics are fast, but often simplified:
• SA Score measures the complexity of compounds using fragment-contribution approach; rarer fragments are taken as an indication of lower synthesizability • SC Score is learned synthetic complexity score computed by a neural network trained from reaction data (Reaxys) • SMILES length is a naive heuristic that associates longer SMILES strings as more synthetically difficulty We should consider alternatives or design better heuristics, answering the question: • What specifically contributes to synthetic complexity? • How does one classify the synthesizability of compounds that are easily obtainable (off-the-shelf), but difficult to synthesize?

14 Thank you for your attention!

15 References [1] W. Gao and C. W. Coley, “The
Synthesizability of Molecules Proposed by Generative Models,” J. Chem. Inf. Model., Apr. 2020 [2] J. Bradshaw, B. Paige, M. J. Kusner, M. Segler, and J. M. Hernández-Lobato, “A Model to Search for Synthesizable Molecules,” in Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. Alché-Buc, E. Fox, and R. Garnett, Eds. Curran Associates, Inc., 2019, pp. 7937–7949. [3] J. Bradshaw, B. Paige, M. J. Kusner, M. Segler, and J. M. Hernández-Lobato, “Barking up the right tree: an approach to search over molecule synthesis DAGs,” Advances in Neural Information Processing Systems, vol. 33, 2020 [4] K. Korovina et al., “ChemBO: Bayesian Optimization of Small Organic Molecules with Synthesizable Recommendations,” in International Conference on Artificial Intelligence and Statistics, Jun. 2020, pp. 3393–3403.

株式会社Elix http://ja.elix-inc.com/ 16

Towards Generating Synthesizable de novo small ...

Towards Generating Synthesizable de novo small Molecules, Elix, CBI 2021

Elix

More Decks by Elix

Other Decks in Technology

Featured

Transcript

Towards Generating Synthesizable de novo small Molecules Jun Jin Choong,

Outline 1. Introduction - Generative Models 2. Generation Paradigm a.

3 Introduction Generative Models Graph Representation of a Molecule A

4 Generation Paradigm Various combinations of synthesizability conﬁgurations. Figure adapted

5 Generation by Bias Training Set Force the model to

6 Generation by Heuristics A common approach: Train a generative

7 Generation by Oracle A common approach: Train a generative

8 Generation by Explicit Constraints Synthesizable compounds are not necessary

9 General Takeaway The more constraints (synthesizability), the smaller the

10 Challenges - Retrosynthesis Training for consideration of novelty requires

11 Challenges - Forward Reaction Predictive Model Example of RexGen

12 Conclusion Synthesizable molecules are crucial for drug development, but

13 What about Heuristics? Heuristics are fast, but often simpliﬁed:

14 Thank you for your attention!

15 References [1] W. Gao and C. W. Coley, “The

株式会社Elix http://ja.elix-inc.com/ 16