Slide 1

Slide 1 text

Active Learning via Incremental Revelation: Dipeptidyl Peptidase-4 Inhibitors Case Study Elix, Inc. David Jimenez Barrero Nazim Medzhidov Chem-Bio Informatics Society (CBI) Annual Meeting 2022, Tokyo, Japan | October 26th, 2022

Slide 2

Slide 2 text

Contents 1. Introduction 2. Contributions 3. Incremental Revelation Framework 4. Results 5. Conclusions 2

Slide 3

Slide 3 text

3 Introduction Active Learning: Is a machine learning algorithm - Constantly improving model(s) - Achieves better accuracy with fewer samples than regular methods, by allowing the model to select which samples to learn from - Useful in search-spaces which are too large and/or expensive to evaluate exhaustively Challenges: Experimenting with Active Learning in a real campaign is expensive Experimenting in-silico from pre-existent data is challenging since: - Data from a single campaign is biased towards a single region of the space - Search-space has boundaries which the models can see, while drug-discovery is unbounded - The models could “see” the rediscovery target molecule from the beginning, which is unrealistic and can be considered to be a bias

Slide 4

Slide 4 text

Contributions Sitagliptin (Known Drug) ● Propose a framework, Incremental Revelation, that helps us to simulate in-silico the conditions in Drug Discovery campaigns using pre-existing data ● Use Incremental Revelation to test Active Learning for the rediscovery of the highest inhibitory activity molecule against DPP4 in the dataset (Linagliptin-derivate) ● Use Incremental Revelation to test Active Learning for the rediscovery of a known drug in the dataset (Sitagliptin) 4 Linagliptin Derivative (Highest Inh. Act. in Dataset)

Slide 5

Slide 5 text

Incremental Revelation data organization 5 Incremental Revelation Algorithm Incremental Revelation: forces a progressive exploration of the space around the molecules that have been already selected. - Inspired in molecular chemists: progressive exploration around encountered molecules. - The model cannot see the entire space; it can only see molecules that are similar to the ones that have been chosen by it. Hidden Set Explorable Set Training Set init. 50 mols Full Set Selected Samples Repeat Until Convergence Target Molecule DPP4 Inhibitors from ChEMBL (4161 molecules)

Slide 6

Slide 6 text

6 Observations: - Few molecules were required to find the target - At most 15% of the total data beginning from a cold start Results: Find the Highest Activity Molecule Linagliptin Derivative (Highest Inh. Act. in Dataset, IC50: 0.05nM) Overall Goal: Find the target molecule (Linagliptin dev.) with few molecules selected

Slide 7

Slide 7 text

7 Additional Results: Known Drug (Sitagliptin) Rediscovery Task: Rediscover known drug (Sitagliptin) using Active Learning. Challenge: Sitagliptin is NOT among the highest inhibitory activity molecules. 1045 molecules in the dataset have higher activity than Sitagliptin. For this end, compared two experimental conditions: Experiment 1: Rediscover Sitagliptin by using a single model targeting pIC50. Experiment 2: Rediscover Sitagliptin by using multiple models targeting pIC50 AND Lipophilicity (logD). - Low lipophilicity is an important discriminator in drug discovery campaigns - The score for molecules becomes the Lipophilic Efficiency (LiPE)

Slide 8

Slide 8 text

8 Additional Results: Known Drug, Sitagliptin Sitagliptin (Known Drug) Observations: - Combining pIC50 and logD showed an improvement in the amount of molecules required to reach Sitagliptin

Slide 9

Slide 9 text

9 Conclusions - We designed a framework, Incremental Revelation, for testing Active Learning in-silico in an unbiased scenario, resembling the search-space conditions in drug discovery. - Through this framework, we estimated the performance of Active Learning as the pure policy for molecule selection in drug discovery. - We observed that Active Learning can quickly find the molecule maximizing a property, such as inhibitory activity (pIC50), starting from a Cold Start. - In a complex search, optimizing multiple key properties (pIC50 and logD) had a better performance than using a single property (pIC50). - In drug discovery, multiple properties are optimized in tandem. Active Learning can be adapted easily to include more than a single property to optimize and guide the search. - We hope this work it paves the way for testing Active Learning in real drug discovery campaigns through collaborations.

Slide 10

Slide 10 text

10 Q & A

Slide 11

Slide 11 text

株式会社Elix http://ja.elix-inc.com/ 11

Slide 12

Slide 12 text

12 APPENDIX

Slide 13

Slide 13 text

Model: Graph Neural Network (Graph Convolutional Net) Statistical Significance: Repeat each campaign 10 times with different random initial Training Set Molecules Per Round: 50 Target: Find the rediscovery target molecule in the dataset. Beginning from a “Cold Start” Overall Goal: Find the target molecule with few molecules selected Initial Round Data Organization Dataset: - Initial Explorable Set: 284 molecules corresponding to early chronological order in the dataset - Initial Training Set: 50 random molecules from the Explorable Set 13 Methodology Hidden Set (3877 mols) Explorable Set (284 mols) Training Set (50 mols) Full Set (4161 mols) DPP4 Inhibitors from ChEMBL

Slide 14

Slide 14 text

14 Linagliptin Derivative (Highest Inh. Act. in Dataset)

Slide 15

Slide 15 text

15 Sitagliptin (Known Drug)

Slide 16

Slide 16 text

16

Slide 17

Slide 17 text

17

Slide 18

Slide 18 text

18

Slide 19

Slide 19 text

19 Results: Campaign Analysis For the overall best performing model (lipo pIC50 tuned), we show: the Spearman correlation coefficient of the model evaluated on the hidden set, and the target’s ranking by the model. The model ranks from 0 to S (where S is the size of the Explorable Set) and picks the N ranked molecules every round. Note 0 is the highest ranked model and S is the lowest. Threshold: 0.77 Threshold: 0.61 Threshold: 0.45

Slide 20

Slide 20 text

20 Results: Campaign Analysis For the overall best performing model (lipo pIC50 tuned), we show: the average Tanimoto distance of every molecule added to the the explorable set; and the amount of molecules added to the explorable set every round. Threshold: 0.77 Threshold: 0.61 Threshold: 0.45 Note that even for the lowest threshold of 0.45, most of the molecules added had around 0.55 tanimoto similarity with the explorable set. This agrees with the study made on the Donepezil drug discovery campaign that found that humans made changes to molecules having on average 0.77 tanimoto distance between them; and a standard deviation of 0.16. This makes our 0.55 well within 2 standard deviations from the mean.

Slide 21

Slide 21 text

21 Results: Campaign Analysis For IC50 only models, we know the ground truth activity, thus we can know which molecules have higher activity than Sitagliptin. We proceed to analyze during the campaign, what percentage of models with higher (better) activity than Sitagliptin were selected by the model each round. We also compare with the remaining percent in the full dataset. Threshold: 0.77 Threshold: 0.61 Threshold: 0.45 We can see that by the end of the campaign (when Sitagliptin was selected around round 37-39), on average ~80% of the compounds with higher activity than Sitagliptin had been already selected. This means that the algorithm is able to select promising compounds, even more than the target, from early stages.

Slide 22

Slide 22 text

22 Results: Full Table of Experiments * Beta refers to the acquisition function upper bound confidence described as: a(score) = score + beta ᐧ std. Beta serves as a way to balance exploitation (the highest predicted score) and exploration (std serves as a confidence measure for the model). • a Defines the best performance results within the 50 molecules vs 20 molecules per round experiments (first 6 in the table, described in the previous slides). • a Defines the best global performance achieved overall. Model Samples per Round Beta * Batch Size Learning Rate Epochs Threshold: 0.77 Threshold: 0.61 Threshold: 0.45 50 molecules vs 20 molecules per round experiments IC50 50 0.5 2048 1.00E-04 200 1620 1580 1580 Lipo IC50 50 0.5 2048 1.00E-04 200 1530 1360 1275 Random 50 NA 2048 1.00E-04 200 2240 2415 2035 IC50 20 0.5 2048 1.00E-04 200 1496 1342 1336 Lipo IC50 20 0.5 2048 1.00E-04 200 1470 1396 1440 Random 20 NA 2048 1.00E-04 200 2722 2444 1780 Hyper- parameter experiments Lipo IC50 50 0 2048 1.00E-04 200 1280 1405 1355 Lipo pIC50 50 0 20 1.00E-03 50 1225 1265 1325 Lipo pIC50 50 0.5 20 1.00E-03 50 1240 1330 1425 Lipo pIC50 50 0.5 128 1.00E-03 80 1565 1505 1560