Slide 1

Slide 1 text

Benchmarking Deployed Generative Models on Elix Discovery Elix, Inc. Vincent Richard Jun Jin Choong, Ph.D Chem-Bio Informatics Society (CBI) Annual Meeting 2023, Tokyo, Japan | October 24th, 2023

Slide 2

Slide 2 text

2 Introduction

Slide 3

Slide 3 text

3 General drug discovery flow in machine learning Datasets ZINC CHEMBL MOSES … Representations O=C(NCCCn1ccnc1)c1cccs1 Evaluation How to assess the performance of molecular generative model? Models VAE GAN RNN … Existing trained model shared weights by the authors, or internal existing model

Slide 4

Slide 4 text

4 Production environment difficulties Model 1 Dataset : ZINC Model 2 Dataset : CHEMBL with custom filters Model 3 Dataset : ZINC + CHEMBL How to evaluate models trained with various datasets ? Is it possible to have a fair evaluation ?

Slide 5

Slide 5 text

5 Current Benchmarking Solutions

Slide 6

Slide 6 text

6 The Current State of Evaluation Metrics for Generative Models Distribution metrics (From MOS ) - FCD (Fréchet ChemNet Distances) - SNN (Similarity to Nearest Neighbor) - Scaffold Similarity - Validity - Uniqueness - Filters (% passing MOSES defined smarts filters) - Novelty - IntDiv (Can detect mode collapse of generative models) - Fragment similarity Oracle based metrics (From TDC and GuacaMol) Molecule generation with a desired property in mind. - Docking score - ML based score (DRD2, JNK3, GSK3B) - Similarity to another molecules - Rediscovery - Isomer identification - Property optimization (LogP, QED, SA) - Scaffold hops - … Generate Feedback Oracle Training Learned distribution [1] Polykosvkiy et al. “Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models” https://arxiv.org/abs/1811.12823 [2] Huang et al. “Therapeutics Data Commons: Machine Learning Datasets and Tasks for Drug Discovery and Development” https://arxiv.org/abs/2102.09548 [3] Brown et al. “GuacaMol: Benchmarking Models for De Novo Molecular Design” https://arxiv.org/abs/1811.09621 [1] [2] [3]

Slide 7

Slide 7 text

7 Internal result benchmark metrics model 1 model 2 model 3 model 4 albuterol_similarity 0.5 ± 0.0065 0.638 ± 0.0328 0.857 ± 0.0077 0.598 ± 0.0052 amlodipine_mpo 0.506 ± 0.0086 0.546 ± 0.0056 0.62 ± 0.0071 0.492 ± 0.0044 celecoxib_rediscovery 0.546 ± 0.0959 0.626 ± 0.0365 0.841 ± 0.0107 0.321 ± 0.0041 deco_hop 0.632 ± 0.0553 0.631 ± 0.0093 0.729 ± 0.0612 0.586 ± 0.0023 drd2 0.948 ± 0.0224 0.973 ± 0.0047 0.978 ± 0.005 0.847 ± 0.0321 fexofenadine_mpo 0.678 ± 0.0072 0.716 ± 0.0114 0.801 ± 0.0059 0.695 ± 0.0037 gsk3b 0.779 ± 0.0693 0.784 ± 0.0142 0.877 ± 0.0197 0.586 ± 0.0332 isomers_c7h8n2o2 0.834 ± 0.0421 0.794 ± 0.022 0.898 ± 0.0112 0.481 ± 0.0741 isomers_c9h10n2o2pf2cl 0.591 ± 0.0489 0.644 ± 0.0102 0.745 ± 0.0069 0.524 ± 0.0149 jnk3 0.387 ± 0.0315 0.439 ± 0.0298 0.573 ± 0.0455 0.361 ± 0.0262 median1 0.183 ± 0.0073 0.268 ± 0.0064 0.338 ± 0.0038 0.198 ± 0.0045 median2 0.219 ± 0.0148 0.221 ± 0.0035 0.28 ± 0.0093 0.165 ± 0.0027 mestranol_similarity 0.367 ± 0.0189 0.618 ± 0.0374 0.835 ± 0.0105 0.367 ± 0.004 osimertinib_mpo 0.796 ± 0.0111 0.795 ± 0.0027 0.825 ± 0.0023 0.756 ± 0.0075 perindopril_mpo 0.444 ± 0.0128 0.471 ± 0.0049 0.538 ± 0.0181 0.463 ± 0.0056 qed 0.936 ± 0.0021 0.937 ± 0.0013 0.939 ± 0.0006 0.931 ± 0.0038 ranolazine_mpo 0.445 ± 0.0213 0.734 ± 0.0036 0.785 ± 0.0017 0.714 ± 0.0053 scaffold_hop 0.498 ± 0.0047 0.475 ± 0.0031 0.498 ± 0.0032 0.464 ± 0.0019 sitagliptin_mpo 0.281 ± 0.0198 0.297 ± 0.0169 0.371 ± 0.0295 0.217 ± 0.0185 thiothixene_rediscovery 0.368 ± 0.0115 0.388 ± 0.0094 0.605 ± 0.0471 0.29 ± 0.0068 troglitazone_rediscovery 0.26 ± 0.007 0.301 ± 0.0083 0.5 ± 0.041 0.234 ± 0.0075 valsartan_smarts 0.033 ± 0.0653 0.0 ± 0.0 0.0 ± 0.0 0.0 ± 0.0002 zaleplon_mpo 0.484 ± 0.0125 0.444 ± 0.0054 0.49 ± 0.0038 0.412 ± 0.0055 Sum 11.715 12.74 14.9 10.7 Mol opt benchmark: Focus on sample efficiency, restricted to 10,000 Oracle calls. The core metric is area under the curve (AUC) of the top-10 average score

Slide 8

Slide 8 text

8 Limitations of current benchmark Current Limitations Distribution Metrics: Easily Beatable Can be fooled by simple task. Most models achieve high accuracy. Good to debug when training a model [1] Renz et al. “On failure modes in molecule generation and optimization” https://www.sciencedirect.com/science/article/pii/S1740674920300159 [1]

Slide 9

Slide 9 text

9 Limitations of current benchmark Current Limitations Distribution Metrics: Easily Beatable Oracle tasks and training data Certain tasks can be addressed by the training dataset alone. The model learns the dataset's distribution, but novelty isn't always guaranteed. A molecule is novel if the nearest neighbor in the training set has a similarity less than 0.4 (ECFP4 based) [1] Franco et al. “The use of 2D fingerprint methods to support the assessment of structural similarity in orphan drug legislation” https://jcheminf.biomedcentral.com/counter/pdf/10.1186/1758-2946-6-5.pdf [1]

Slide 10

Slide 10 text

10 Oracle task bias by the training data Current Oracles Real Case Scenarios

Slide 11

Slide 11 text

11 Limitations of current benchmark Current Limitations Distribution Metrics: Easily Beatable Oracle tasks and training data Oracle tasks don’t focus on synthesizable molecules Focusing only on objective is insufficient, a molecule has no value if it cannot be made. Biaising the model toward synthesizable molecules thought an oracle. [1] Gao et al. “The Synthesizability of Molecules Proposed by Generative Models” https://pubs.acs.org/doi/10.1021/acs.jcim.0c00174 [1]

Slide 12

Slide 12 text

12 Proposed Solution

Slide 13

Slide 13 text

13 Proposed solution How to evaluate models trained with various datasets ? Is it possible to have a fair evaluation? One solution is to always evaluate on novelty First model objective set Advantages: Track the ability of the model to generate out of distribution molecules. Second model objective set Disadvantages: Tasks are not equally difficult for each model. A need for diversity.

Slide 14

Slide 14 text

14 Proposed solution: The task Docking [1] Cieplinski and al. “We Should at Least Be Able to Design Molecules That Dock Well” https://arxiv.org/abs/2006.16955 Difficult and reflects real case scenario Each target has a unique objective chemical space Can be extended to Fragment based Optimization [1] Docking

Slide 15

Slide 15 text

15 Proposed solution Docking Score Drug-like Objective MW < 600 Pass Set of Chemical Rules Rotatable Bonds < 10 … Synthesizability Objective Add an Synthetic Accessibility score filter SA < 4 Oracle Task: (5k and 10k Oracle calls) Evaluation Filter on drug like objective Filter on SA scores Filter out molecule similar to the training set Compute Evaluation AUC of Top 5% Docking Score of Novel Molecules Distribution metric for sanity check

Slide 16

Slide 16 text

16 Conclusion ● We detailed the limitation of current existing benchmark in the litterature ● We shared a new benchmark direction, which reflects real case scenario and solves the issue of evaluating model in production. ● Especially, with elix benchmark all high performing model can be trusted in real-case scenarios. Which was not the case for existing benchmark. Future Directions: ● Extend the benchmark to other Drug discovery tasks not present in current literature, especially Lead Optimization tasks.

Slide 17

Slide 17 text

Thank you for your attention. Q & A 17

Slide 18

Slide 18 text

www.elix-inc.com

Slide 19

Slide 19 text

Appendix 19

Slide 20

Slide 20 text

20 Oracle task bias by the training data From our previous results with various models Novelty threshold In real scenarios we are looking for novel molecule. We observed that no model produces novel (enough) molecules. All similarities are computed on a ECFP4 based fingerprint with tanimoto similarity

Slide 21

Slide 21 text

21 Explanation of the benefit of Novelty for evaluation Current Benchmark How to evaluate those models? Is it possible to have a fair evaluation with various training data? Various training data Good Result Can’t be trusted Checking training data is needed Poor Result Can be trusted Can be trusted Elix Benchmark Same training data Various training data Good Result Can be trusted Can be trusted Poor Result Checking training data is needed Can be trusted Same training data

Slide 22

Slide 22 text

22 Sample Efficiency Matters Generate Feedback Oracle Generation process with an oracle Descriptors . . . Others Speed Accuracy +++ ++ ++ ~ – +++ – – ++++ [1] Gao and al. “Sample Efficiency Matters: A Benchmark for Practical Molecular Optimization” https://arxiv.org/abs/2206.12411 [2] Cieplinski and al. “We Should at Least Be Able to Design Molecules That Dock Well” https://arxiv.org/abs/2006.16955 [3] Sundin and al. “Human-in-the-loop assisted de novo molecular design” https://jcheminf.biomedcentral.com/articles/10.1186/s13321-022-00667-8 [1] Docking [2] Human in the loop[3] Activity ADME Property Toxicity … Machine learning models