Generation by Bias Training Set b. Generation by Heuristics c. Generation by Oracle d. Generation by Explicit Constraints 3. Synthesizability vs. Novelty - The Challenges a. Retrosynthesis b. Forward Prediction 4. Conclusion 2
machine learning, probabilistic perspective Encoder Model Estimation Decoder Decoded Molecule By deﬁnition, a generative model learns the joint distribution , where is the data and is the latent variable.
learn from synthesizable starting molecules. Synthesizable Molecules Non-Synthesizable Molecules Molecular Space Advantages: - We explore molecules that are synthesizable from the start, giving us higher leverage on success. - Fast and easy to train Disadvantages: - Local optima problem - Limited novelty of molecules
model by teaching it what is synthesize and what is not. Example: Change the loss function to incorporate synthesizability loss Backpropagate Loss Models that utilizes rewards (reinforcement learning) may consider these reward signals as additional constraints (i.e. REINVENT) Advantages: - easy to implement Disadvantages: - diﬃcult to train - heuristic may be naive
model by teaching it what is synthesize and what is not. Keep the loss function the same, replace the heuristic with a proxy/oracle (computer-aided synthesis planning CASP). Advantages: - easy to implement - less likely to fall into local optima Disadvantages: - accuracy subject to the quality of CASP - computationally expensive
purchasable or accessible easily Synthesizable Molecules Non-Synthesizable Molecules Molecular Space Advantages: - Multi-step synthesizability can be considered - Generated molecules are guaranteed to be more accessible - Most purchasable compounds are public knowledge Disadvantages: - Search space becomes smaller - Computationally more expensive - More dependencies Purchasable Compounds
search space, leading to less chances of discovering more novel molecules Novelty exploration Synthesizability exploitation The more more a molecules gets, the more diﬃcult it is to synthesize it. These molecules tend to be unrealistic (no viable synthesizable pathways, or are too expensive to be synthesized). Discover unknown molecule from a large chemical search space Identify reverse synthesizability pathway of such novel compound Predict next molecule from synthesizable compound Perform combinatorial matching for new compound discovery Training for consideration of novelty requires an equally powerful retrosynthesis model Training for synthesizability requires an equally powerful predictive model for forward reaction The Compromise? Exploitation vs. Exploration
an equally powerful retrosynthesis model - Retrosynthesis error has room for error - Training of retrosynthesis model is challenging - Requires reaction prediction data - Limited to datasets like USPTO or proprietary dataset like Reaxys - The correctness of these datasets may sometimes be inaccurate Example of AiZynthFinder GUI interface
Framework - In its simplest form, a template is applied on certain reactants to obtain the resulting reaction. - More complex forward prediction model requires large training data - dataset are similar to retrosynthesis - similar problems are retrosynthesis datasets - Ideal conditions and mixture ratios are often absent - Some models like RexGen are available to public but model training is not replicable
they come with several caveats: - they restrict the search space considerably - they rely on forward reaction predictive models for multi-step generation The advantages are: - more realistic - starting compounds could be more procurable Generating novel de novo molecules are useful for discovering new molecules, but they come with several caveats: - they are usually one-step generations (reactants -> end product) - they rely on retrosynthesis for realistic applications The advantages are: - more room for creativity - molecules may have better properties Alternative considerations: - Design better heuristic methods for assessing synthesizability
• SA Score measures the complexity of compounds using fragment-contribution approach; rarer fragments are taken as an indication of lower synthesizability • SC Score is learned synthetic complexity score computed by a neural network trained from reaction data (Reaxys) • SMILES length is a naive heuristic that associates longer SMILES strings as more synthetically diﬃculty We should consider alternatives or design better heuristics, answering the question: • What speciﬁcally contributes to synthetic complexity? • How does one classify the synthesizability of compounds that are easily obtainable (off-the-shelf), but diﬃcult to synthesize?
Synthesizability of Molecules Proposed by Generative Models,” J. Chem. Inf. Model., Apr. 2020  J. Bradshaw, B. Paige, M. J. Kusner, M. Segler, and J. M. Hernández-Lobato, “A Model to Search for Synthesizable Molecules,” in Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. Alché-Buc, E. Fox, and R. Garnett, Eds. Curran Associates, Inc., 2019, pp. 7937–7949.  J. Bradshaw, B. Paige, M. J. Kusner, M. Segler, and J. M. Hernández-Lobato, “Barking up the right tree: an approach to search over molecule synthesis DAGs,” Advances in Neural Information Processing Systems, vol. 33, 2020  K. Korovina et al., “ChemBO: Bayesian Optimization of Small Organic Molecules with Synthesizable Recommendations,” in International Conference on Artificial Intelligence and Statistics, Jun. 2020, pp. 3393–3403.