Constantly improving model(s) - Achieves better accuracy with fewer samples than regular methods, by allowing the model to select which samples to learn from - Useful in search-spaces which are too large and/or expensive to evaluate exhaustively Challenges: Experimenting with Active Learning in a real campaign is expensive Experimenting in-silico from pre-existent data is challenging since: - Data from a single campaign is biased towards a single region of the space - Search-space has boundaries which the models can see, while drug-discovery is unbounded - The models could “see” the rediscovery target molecule from the beginning, which is unrealistic and can be considered to be a bias
that helps us to simulate in-silico the conditions in Drug Discovery campaigns using pre-existing data • Use Incremental Revelation to test Active Learning for the rediscovery of the highest inhibitory activity molecule against DPP4 in the dataset (Linagliptin-derivate) • Use Incremental Revelation to test Active Learning for the rediscovery of a known drug in the dataset (Sitagliptin) 4 Linagliptin Derivative (Highest Inh. Act. in Dataset)
forces a progressive exploration of the space around the molecules that have been already selected. - Inspired in molecular chemists: progressive exploration around encountered molecules. - The model cannot see the entire space; it can only see molecules that are similar to the ones that have been chosen by it. Hidden Set Explorable Set Training Set init. 50 mols Full Set Selected Samples Repeat Until Convergence Target Molecule DPP4 Inhibitors from ChEMBL (4161 molecules)
target - At most 15% of the total data beginning from a cold start Results: Find the Highest Activity Molecule Linagliptin Derivative (Highest Inh. Act. in Dataset, IC50: 0.05nM) Overall Goal: Find the target molecule (Linagliptin dev.) with few molecules selected
drug (Sitagliptin) using Active Learning. Challenge: Sitagliptin is NOT among the highest inhibitory activity molecules. 1045 molecules in the dataset have higher activity than Sitagliptin. For this end, compared two experimental conditions: Experiment 1: Rediscover Sitagliptin by using a single model targeting pIC50. Experiment 2: Rediscover Sitagliptin by using multiple models targeting pIC50 AND Lipophilicity (logD). - Low lipophilicity is an important discriminator in drug discovery campaigns - The score for molecules becomes the Lipophilic Efficiency (LiPE)
testing Active Learning in-silico in an unbiased scenario, resembling the search-space conditions in drug discovery. - Through this framework, we estimated the performance of Active Learning as the pure policy for molecule selection in drug discovery. - We observed that Active Learning can quickly find the molecule maximizing a property, such as inhibitory activity (pIC50), starting from a Cold Start. - In a complex search, optimizing multiple key properties (pIC50 and logD) had a better performance than using a single property (pIC50). - In drug discovery, multiple properties are optimized in tandem. Active Learning can be adapted easily to include more than a single property to optimize and guide the search. - We hope this work it paves the way for testing Active Learning in real drug discovery campaigns through collaborations.
each campaign 10 times with different random initial Training Set Molecules Per Round: 50 Target: Find the rediscovery target molecule in the dataset. Beginning from a “Cold Start” Overall Goal: Find the target molecule with few molecules selected Initial Round Data Organization Dataset: - Initial Explorable Set: 284 molecules corresponding to early chronological order in the dataset - Initial Training Set: 50 random molecules from the Explorable Set 13 Methodology Hidden Set (3877 mols) Explorable Set (284 mols) Training Set (50 mols) Full Set (4161 mols) DPP4 Inhibitors from ChEMBL
(lipo pIC50 tuned), we show: the Spearman correlation coefficient of the model evaluated on the hidden set, and the target’s ranking by the model. The model ranks from 0 to S (where S is the size of the Explorable Set) and picks the N ranked molecules every round. Note 0 is the highest ranked model and S is the lowest. Threshold: 0.77 Threshold: 0.61 Threshold: 0.45
(lipo pIC50 tuned), we show: the average Tanimoto distance of every molecule added to the the explorable set; and the amount of molecules added to the explorable set every round. Threshold: 0.77 Threshold: 0.61 Threshold: 0.45 Note that even for the lowest threshold of 0.45, most of the molecules added had around 0.55 tanimoto similarity with the explorable set. This agrees with the study made on the Donepezil drug discovery campaign that found that humans made changes to molecules having on average 0.77 tanimoto distance between them; and a standard deviation of 0.16. This makes our 0.55 well within 2 standard deviations from the mean.
the ground truth activity, thus we can know which molecules have higher activity than Sitagliptin. We proceed to analyze during the campaign, what percentage of models with higher (better) activity than Sitagliptin were selected by the model each round. We also compare with the remaining percent in the full dataset. Threshold: 0.77 Threshold: 0.61 Threshold: 0.45 We can see that by the end of the campaign (when Sitagliptin was selected around round 37-39), on average ~80% of the compounds with higher activity than Sitagliptin had been already selected. This means that the algorithm is able to select promising compounds, even more than the target, from early stages.