Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Active Learning via Incremental Revelation: Dip...

Elix
October 26, 2022

Active Learning via Incremental Revelation: Dipeptidyl Peptidase-4 Inhibitors Case Study, Elix, CBI 2022

Elix

October 26, 2022
Tweet

More Decks by Elix

Other Decks in Research

Transcript

  1. Active Learning via Incremental Revelation: Dipeptidyl Peptidase-4 Inhibitors Case Study

    Elix, Inc. David Jimenez Barrero Nazim Medzhidov Chem-Bio Informatics Society (CBI) Annual Meeting 2022, Tokyo, Japan | October 26th, 2022
  2. 3 Introduction Active Learning: Is a machine learning algorithm -

    Constantly improving model(s) - Achieves better accuracy with fewer samples than regular methods, by allowing the model to select which samples to learn from - Useful in search-spaces which are too large and/or expensive to evaluate exhaustively Challenges: Experimenting with Active Learning in a real campaign is expensive Experimenting in-silico from pre-existent data is challenging since: - Data from a single campaign is biased towards a single region of the space - Search-space has boundaries which the models can see, while drug-discovery is unbounded - The models could “see” the rediscovery target molecule from the beginning, which is unrealistic and can be considered to be a bias
  3. Contributions Sitagliptin (Known Drug) • Propose a framework, Incremental Revelation,

    that helps us to simulate in-silico the conditions in Drug Discovery campaigns using pre-existing data • Use Incremental Revelation to test Active Learning for the rediscovery of the highest inhibitory activity molecule against DPP4 in the dataset (Linagliptin-derivate) • Use Incremental Revelation to test Active Learning for the rediscovery of a known drug in the dataset (Sitagliptin) 4 Linagliptin Derivative (Highest Inh. Act. in Dataset)
  4. Incremental Revelation data organization 5 Incremental Revelation Algorithm Incremental Revelation:

    forces a progressive exploration of the space around the molecules that have been already selected. - Inspired in molecular chemists: progressive exploration around encountered molecules. - The model cannot see the entire space; it can only see molecules that are similar to the ones that have been chosen by it. Hidden Set Explorable Set Training Set init. 50 mols Full Set Selected Samples Repeat Until Convergence Target Molecule DPP4 Inhibitors from ChEMBL (4161 molecules)
  5. 6 Observations: - Few molecules were required to find the

    target - At most 15% of the total data beginning from a cold start Results: Find the Highest Activity Molecule Linagliptin Derivative (Highest Inh. Act. in Dataset, IC50: 0.05nM) Overall Goal: Find the target molecule (Linagliptin dev.) with few molecules selected
  6. 7 Additional Results: Known Drug (Sitagliptin) Rediscovery Task: Rediscover known

    drug (Sitagliptin) using Active Learning. Challenge: Sitagliptin is NOT among the highest inhibitory activity molecules. 1045 molecules in the dataset have higher activity than Sitagliptin. For this end, compared two experimental conditions: Experiment 1: Rediscover Sitagliptin by using a single model targeting pIC50. Experiment 2: Rediscover Sitagliptin by using multiple models targeting pIC50 AND Lipophilicity (logD). - Low lipophilicity is an important discriminator in drug discovery campaigns - The score for molecules becomes the Lipophilic Efficiency (LiPE)
  7. 8 Additional Results: Known Drug, Sitagliptin Sitagliptin (Known Drug) Observations:

    - Combining pIC50 and logD showed an improvement in the amount of molecules required to reach Sitagliptin
  8. 9 Conclusions - We designed a framework, Incremental Revelation, for

    testing Active Learning in-silico in an unbiased scenario, resembling the search-space conditions in drug discovery. - Through this framework, we estimated the performance of Active Learning as the pure policy for molecule selection in drug discovery. - We observed that Active Learning can quickly find the molecule maximizing a property, such as inhibitory activity (pIC50), starting from a Cold Start. - In a complex search, optimizing multiple key properties (pIC50 and logD) had a better performance than using a single property (pIC50). - In drug discovery, multiple properties are optimized in tandem. Active Learning can be adapted easily to include more than a single property to optimize and guide the search. - We hope this work it paves the way for testing Active Learning in real drug discovery campaigns through collaborations.
  9. Model: Graph Neural Network (Graph Convolutional Net) Statistical Significance: Repeat

    each campaign 10 times with different random initial Training Set Molecules Per Round: 50 Target: Find the rediscovery target molecule in the dataset. Beginning from a “Cold Start” Overall Goal: Find the target molecule with few molecules selected Initial Round Data Organization Dataset: - Initial Explorable Set: 284 molecules corresponding to early chronological order in the dataset - Initial Training Set: 50 random molecules from the Explorable Set 13 Methodology Hidden Set (3877 mols) Explorable Set (284 mols) Training Set (50 mols) Full Set (4161 mols) DPP4 Inhibitors from ChEMBL
  10. 16

  11. 17

  12. 18

  13. 19 Results: Campaign Analysis For the overall best performing model

    (lipo pIC50 tuned), we show: the Spearman correlation coefficient of the model evaluated on the hidden set, and the target’s ranking by the model. The model ranks from 0 to S (where S is the size of the Explorable Set) and picks the N ranked molecules every round. Note 0 is the highest ranked model and S is the lowest. Threshold: 0.77 Threshold: 0.61 Threshold: 0.45
  14. 20 Results: Campaign Analysis For the overall best performing model

    (lipo pIC50 tuned), we show: the average Tanimoto distance of every molecule added to the the explorable set; and the amount of molecules added to the explorable set every round. Threshold: 0.77 Threshold: 0.61 Threshold: 0.45 Note that even for the lowest threshold of 0.45, most of the molecules added had around 0.55 tanimoto similarity with the explorable set. This agrees with the study made on the Donepezil drug discovery campaign that found that humans made changes to molecules having on average 0.77 tanimoto distance between them; and a standard deviation of 0.16. This makes our 0.55 well within 2 standard deviations from the mean.
  15. 21 Results: Campaign Analysis For IC50 only models, we know

    the ground truth activity, thus we can know which molecules have higher activity than Sitagliptin. We proceed to analyze during the campaign, what percentage of models with higher (better) activity than Sitagliptin were selected by the model each round. We also compare with the remaining percent in the full dataset. Threshold: 0.77 Threshold: 0.61 Threshold: 0.45 We can see that by the end of the campaign (when Sitagliptin was selected around round 37-39), on average ~80% of the compounds with higher activity than Sitagliptin had been already selected. This means that the algorithm is able to select promising compounds, even more than the target, from early stages.
  16. 22 Results: Full Table of Experiments * Beta refers to

    the acquisition function upper bound confidence described as: a(score) = score + beta ᐧ std. Beta serves as a way to balance exploitation (the highest predicted score) and exploration (std serves as a confidence measure for the model). • a Defines the best performance results within the 50 molecules vs 20 molecules per round experiments (first 6 in the table, described in the previous slides). • a Defines the best global performance achieved overall. Model Samples per Round Beta * Batch Size Learning Rate Epochs Threshold: 0.77 Threshold: 0.61 Threshold: 0.45 50 molecules vs 20 molecules per round experiments IC50 50 0.5 2048 1.00E-04 200 1620 1580 1580 Lipo IC50 50 0.5 2048 1.00E-04 200 1530 1360 1275 Random 50 NA 2048 1.00E-04 200 2240 2415 2035 IC50 20 0.5 2048 1.00E-04 200 1496 1342 1336 Lipo IC50 20 0.5 2048 1.00E-04 200 1470 1396 1440 Random 20 NA 2048 1.00E-04 200 2722 2444 1780 Hyper- parameter experiments Lipo IC50 50 0 2048 1.00E-04 200 1280 1405 1355 Lipo pIC50 50 0 20 1.00E-03 50 1225 1265 1325 Lipo pIC50 50 0.5 20 1.00E-03 50 1240 1330 1425 Lipo pIC50 50 0.5 128 1.00E-03 80 1565 1505 1560