Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Benchmarking Deployed Generative Models on Elix Discovery, Elix, CBI 2023

Elix
October 24, 2023

Benchmarking Deployed Generative Models on Elix Discovery, Elix, CBI 2023

Elix

October 24, 2023
Tweet

More Decks by Elix

Other Decks in Research

Transcript

  1. Benchmarking Deployed Generative Models on Elix Discovery Elix, Inc. Vincent

    Richard Jun Jin Choong, Ph.D Chem-Bio Informatics Society (CBI) Annual Meeting 2023, Tokyo, Japan | October 24th, 2023
  2. 3 General drug discovery flow in machine learning Datasets ZINC

    CHEMBL MOSES … Representations O=C(NCCCn1ccnc1)c1cccs1 Evaluation How to assess the performance of molecular generative model? Models VAE GAN RNN … Existing trained model shared weights by the authors, or internal existing model
  3. 4 Production environment difficulties Model 1 Dataset : ZINC Model

    2 Dataset : CHEMBL with custom filters Model 3 Dataset : ZINC + CHEMBL How to evaluate models trained with various datasets ? Is it possible to have a fair evaluation ?
  4. 6 The Current State of Evaluation Metrics for Generative Models

    Distribution metrics (From MOS ) - FCD (Fréchet ChemNet Distances) - SNN (Similarity to Nearest Neighbor) - Scaffold Similarity - Validity - Uniqueness - Filters (% passing MOSES defined smarts filters) - Novelty - IntDiv (Can detect mode collapse of generative models) - Fragment similarity Oracle based metrics (From TDC and GuacaMol) Molecule generation with a desired property in mind. - Docking score - ML based score (DRD2, JNK3, GSK3B) - Similarity to another molecules - Rediscovery - Isomer identification - Property optimization (LogP, QED, SA) - Scaffold hops - … Generate Feedback Oracle Training Learned distribution [1] Polykosvkiy et al. “Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models” https://arxiv.org/abs/1811.12823 [2] Huang et al. “Therapeutics Data Commons: Machine Learning Datasets and Tasks for Drug Discovery and Development” https://arxiv.org/abs/2102.09548 [3] Brown et al. “GuacaMol: Benchmarking Models for De Novo Molecular Design” https://arxiv.org/abs/1811.09621 [1] [2] [3]
  5. 7 Internal result benchmark metrics model 1 model 2 model

    3 model 4 albuterol_similarity 0.5 ± 0.0065 0.638 ± 0.0328 0.857 ± 0.0077 0.598 ± 0.0052 amlodipine_mpo 0.506 ± 0.0086 0.546 ± 0.0056 0.62 ± 0.0071 0.492 ± 0.0044 celecoxib_rediscovery 0.546 ± 0.0959 0.626 ± 0.0365 0.841 ± 0.0107 0.321 ± 0.0041 deco_hop 0.632 ± 0.0553 0.631 ± 0.0093 0.729 ± 0.0612 0.586 ± 0.0023 drd2 0.948 ± 0.0224 0.973 ± 0.0047 0.978 ± 0.005 0.847 ± 0.0321 fexofenadine_mpo 0.678 ± 0.0072 0.716 ± 0.0114 0.801 ± 0.0059 0.695 ± 0.0037 gsk3b 0.779 ± 0.0693 0.784 ± 0.0142 0.877 ± 0.0197 0.586 ± 0.0332 isomers_c7h8n2o2 0.834 ± 0.0421 0.794 ± 0.022 0.898 ± 0.0112 0.481 ± 0.0741 isomers_c9h10n2o2pf2cl 0.591 ± 0.0489 0.644 ± 0.0102 0.745 ± 0.0069 0.524 ± 0.0149 jnk3 0.387 ± 0.0315 0.439 ± 0.0298 0.573 ± 0.0455 0.361 ± 0.0262 median1 0.183 ± 0.0073 0.268 ± 0.0064 0.338 ± 0.0038 0.198 ± 0.0045 median2 0.219 ± 0.0148 0.221 ± 0.0035 0.28 ± 0.0093 0.165 ± 0.0027 mestranol_similarity 0.367 ± 0.0189 0.618 ± 0.0374 0.835 ± 0.0105 0.367 ± 0.004 osimertinib_mpo 0.796 ± 0.0111 0.795 ± 0.0027 0.825 ± 0.0023 0.756 ± 0.0075 perindopril_mpo 0.444 ± 0.0128 0.471 ± 0.0049 0.538 ± 0.0181 0.463 ± 0.0056 qed 0.936 ± 0.0021 0.937 ± 0.0013 0.939 ± 0.0006 0.931 ± 0.0038 ranolazine_mpo 0.445 ± 0.0213 0.734 ± 0.0036 0.785 ± 0.0017 0.714 ± 0.0053 scaffold_hop 0.498 ± 0.0047 0.475 ± 0.0031 0.498 ± 0.0032 0.464 ± 0.0019 sitagliptin_mpo 0.281 ± 0.0198 0.297 ± 0.0169 0.371 ± 0.0295 0.217 ± 0.0185 thiothixene_rediscovery 0.368 ± 0.0115 0.388 ± 0.0094 0.605 ± 0.0471 0.29 ± 0.0068 troglitazone_rediscovery 0.26 ± 0.007 0.301 ± 0.0083 0.5 ± 0.041 0.234 ± 0.0075 valsartan_smarts 0.033 ± 0.0653 0.0 ± 0.0 0.0 ± 0.0 0.0 ± 0.0002 zaleplon_mpo 0.484 ± 0.0125 0.444 ± 0.0054 0.49 ± 0.0038 0.412 ± 0.0055 Sum 11.715 12.74 14.9 10.7 Mol opt benchmark: Focus on sample efficiency, restricted to 10,000 Oracle calls. The core metric is area under the curve (AUC) of the top-10 average score
  6. 8 Limitations of current benchmark Current Limitations Distribution Metrics: Easily

    Beatable Can be fooled by simple task. Most models achieve high accuracy. Good to debug when training a model [1] Renz et al. “On failure modes in molecule generation and optimization” https://www.sciencedirect.com/science/article/pii/S1740674920300159 [1]
  7. 9 Limitations of current benchmark Current Limitations Distribution Metrics: Easily

    Beatable Oracle tasks and training data Certain tasks can be addressed by the training dataset alone. The model learns the dataset's distribution, but novelty isn't always guaranteed. A molecule is novel if the nearest neighbor in the training set has a similarity less than 0.4 (ECFP4 based) [1] Franco et al. “The use of 2D fingerprint methods to support the assessment of structural similarity in orphan drug legislation” https://jcheminf.biomedcentral.com/counter/pdf/10.1186/1758-2946-6-5.pdf [1]
  8. 11 Limitations of current benchmark Current Limitations Distribution Metrics: Easily

    Beatable Oracle tasks and training data Oracle tasks don’t focus on synthesizable molecules Focusing only on objective is insufficient, a molecule has no value if it cannot be made. Biaising the model toward synthesizable molecules thought an oracle. [1] Gao et al. “The Synthesizability of Molecules Proposed by Generative Models” https://pubs.acs.org/doi/10.1021/acs.jcim.0c00174 [1]
  9. 13 Proposed solution How to evaluate models trained with various

    datasets ? Is it possible to have a fair evaluation? One solution is to always evaluate on novelty First model objective set Advantages: Track the ability of the model to generate out of distribution molecules. Second model objective set Disadvantages: Tasks are not equally difficult for each model. A need for diversity.
  10. 14 Proposed solution: The task Docking [1] Cieplinski and al.

    “We Should at Least Be Able to Design Molecules That Dock Well” https://arxiv.org/abs/2006.16955 Difficult and reflects real case scenario Each target has a unique objective chemical space Can be extended to Fragment based Optimization [1] Docking
  11. 15 Proposed solution Docking Score Drug-like Objective MW < 600

    Pass Set of Chemical Rules Rotatable Bonds < 10 … Synthesizability Objective Add an Synthetic Accessibility score filter SA < 4 Oracle Task: (5k and 10k Oracle calls) Evaluation Filter on drug like objective Filter on SA scores Filter out molecule similar to the training set Compute Evaluation AUC of Top 5% Docking Score of Novel Molecules Distribution metric for sanity check
  12. 16 Conclusion • We detailed the limitation of current existing

    benchmark in the litterature • We shared a new benchmark direction, which reflects real case scenario and solves the issue of evaluating model in production. • Especially, with elix benchmark all high performing model can be trusted in real-case scenarios. Which was not the case for existing benchmark. Future Directions: • Extend the benchmark to other Drug discovery tasks not present in current literature, especially Lead Optimization tasks.
  13. 20 Oracle task bias by the training data From our

    previous results with various models Novelty threshold In real scenarios we are looking for novel molecule. We observed that no model produces novel (enough) molecules. All similarities are computed on a ECFP4 based fingerprint with tanimoto similarity
  14. 21 Explanation of the benefit of Novelty for evaluation Current

    Benchmark How to evaluate those models? Is it possible to have a fair evaluation with various training data? Various training data Good Result Can’t be trusted Checking training data is needed Poor Result Can be trusted Can be trusted Elix Benchmark Same training data Various training data Good Result Can be trusted Can be trusted Poor Result Checking training data is needed Can be trusted Same training data
  15. 22 Sample Efficiency Matters Generate Feedback Oracle Generation process with

    an oracle Descriptors . . . Others Speed Accuracy +++ ++ ++ ~ – +++ – – ++++ [1] Gao and al. “Sample Efficiency Matters: A Benchmark for Practical Molecular Optimization” https://arxiv.org/abs/2206.12411 [2] Cieplinski and al. “We Should at Least Be Able to Design Molecules That Dock Well” https://arxiv.org/abs/2006.16955 [3] Sundin and al. “Human-in-the-loop assisted de novo molecular design” https://jcheminf.biomedcentral.com/articles/10.1186/s13321-022-00667-8 [1] Docking [2] Human in the loop[3] Activity ADME Property Toxicity … Machine learning models