Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Active Learning via Incremental Revelation: Dipeptidyl Peptidase-4 Inhibitors Case Study, Elix, CBI 2022

Elix
October 27, 2022

Active Learning via Incremental Revelation: Dipeptidyl Peptidase-4 Inhibitors Case Study, Elix, CBI 2022

Elix

October 27, 2022
Tweet

More Decks by Elix

Other Decks in Research

Transcript

  1. Active Learning via
    Incremental Revelation:
    Dipeptidyl Peptidase-4 Inhibitors
    Case Study
    Elix, Inc.
    David Jimenez Barrero
    Nazim Medzhidov
    Chem-Bio Informatics Society (CBI) Annual Meeting 2022, Tokyo, Japan | October 26th, 2022

    View Slide

  2. Contents
    1. Introduction
    2. Contributions
    3. Incremental Revelation Framework
    4. Results
    5. Conclusions
    2

    View Slide

  3. 3
    Introduction
    Active Learning: Is a machine learning algorithm
    - Constantly improving model(s)
    - Achieves better accuracy with fewer samples than regular methods, by allowing the model to select which samples
    to learn from
    - Useful in search-spaces which are too large and/or expensive to evaluate exhaustively
    Challenges:
    Experimenting with Active Learning in a real campaign is expensive
    Experimenting in-silico from pre-existent data is challenging since:
    - Data from a single campaign is biased towards a single region of the space
    - Search-space has boundaries which the models can see, while drug-discovery is unbounded
    - The models could “see” the rediscovery target molecule from the beginning, which is unrealistic and can be
    considered to be a bias

    View Slide

  4. Contributions
    Sitagliptin (Known Drug)
    ● Propose a framework, Incremental Revelation, that helps us to
    simulate in-silico the conditions in Drug Discovery campaigns using
    pre-existing data
    ● Use Incremental Revelation to test Active Learning for the
    rediscovery of the highest inhibitory activity molecule against DPP4
    in the dataset (Linagliptin-derivate)
    ● Use Incremental Revelation to test Active Learning for the
    rediscovery of a known drug in the dataset (Sitagliptin)
    4
    Linagliptin Derivative (Highest Inh. Act. in Dataset)

    View Slide

  5. Incremental Revelation data organization
    5
    Incremental Revelation Algorithm
    Incremental Revelation: forces a progressive exploration of the
    space around the molecules that have been already selected.
    - Inspired in molecular chemists: progressive exploration around
    encountered molecules.
    - The model cannot see the entire space; it can only see molecules
    that are similar to the ones that have been chosen by it.
    Hidden Set
    Explorable
    Set
    Training
    Set
    init. 50 mols
    Full
    Set
    Selected
    Samples
    Repeat Until
    Convergence
    Target
    Molecule
    DPP4 Inhibitors from ChEMBL
    (4161 molecules)

    View Slide

  6. 6
    Observations:
    - Few molecules were required
    to find the target
    - At most 15% of the total data
    beginning from a cold start
    Results: Find the Highest Activity Molecule
    Linagliptin Derivative (Highest Inh. Act.
    in Dataset, IC50: 0.05nM)
    Overall Goal: Find the target molecule (Linagliptin dev.) with
    few molecules selected

    View Slide

  7. 7
    Additional Results: Known Drug (Sitagliptin) Rediscovery
    Task: Rediscover known drug (Sitagliptin) using Active Learning.
    Challenge: Sitagliptin is NOT among the highest inhibitory activity molecules. 1045 molecules in the dataset
    have higher activity than Sitagliptin.
    For this end, compared two experimental conditions:
    Experiment 1: Rediscover Sitagliptin by using a single model targeting pIC50.
    Experiment 2: Rediscover Sitagliptin by using multiple models targeting pIC50 AND Lipophilicity (logD).
    - Low lipophilicity is an important discriminator in drug discovery campaigns
    - The score for molecules becomes the Lipophilic Efficiency (LiPE)

    View Slide

  8. 8
    Additional Results: Known Drug, Sitagliptin
    Sitagliptin (Known Drug)
    Observations:
    - Combining pIC50 and logD showed
    an improvement in the amount of
    molecules required to reach
    Sitagliptin

    View Slide

  9. 9
    Conclusions
    - We designed a framework, Incremental Revelation, for testing Active Learning in-silico in an unbiased scenario,
    resembling the search-space conditions in drug discovery.
    - Through this framework, we estimated the performance of Active Learning as the pure policy for molecule
    selection in drug discovery.
    - We observed that Active Learning can quickly find the molecule maximizing a property, such as inhibitory activity
    (pIC50), starting from a Cold Start.
    - In a complex search, optimizing multiple key properties (pIC50 and logD) had a better performance than using a
    single property (pIC50).
    - In drug discovery, multiple properties are optimized in tandem. Active Learning can be adapted easily to include
    more than a single property to optimize and guide the search.
    - We hope this work it paves the way for testing Active Learning in real drug discovery campaigns through
    collaborations.

    View Slide

  10. 10
    Q & A

    View Slide

  11. 株式会社Elix
    http://ja.elix-inc.com/
    11

    View Slide

  12. 12
    APPENDIX

    View Slide

  13. Model: Graph Neural Network (Graph Convolutional Net)
    Statistical Significance: Repeat each campaign 10 times with
    different random initial Training Set
    Molecules Per Round: 50
    Target: Find the rediscovery target molecule in the dataset.
    Beginning from a “Cold Start”
    Overall Goal: Find the target molecule with few molecules
    selected
    Initial Round Data Organization
    Dataset:
    - Initial Explorable Set: 284 molecules corresponding to early
    chronological order in the dataset
    - Initial Training Set: 50 random molecules from the Explorable
    Set
    13
    Methodology
    Hidden Set (3877 mols)
    Explorable
    Set (284 mols)
    Training
    Set
    (50 mols)
    Full Set
    (4161 mols)
    DPP4 Inhibitors from ChEMBL

    View Slide

  14. 14
    Linagliptin Derivative (Highest Inh. Act.
    in Dataset)

    View Slide

  15. 15
    Sitagliptin (Known Drug)

    View Slide

  16. 16

    View Slide

  17. 17

    View Slide

  18. 18

    View Slide

  19. 19
    Results: Campaign Analysis
    For the overall best performing model (lipo pIC50 tuned), we show: the Spearman correlation coefficient of the model evaluated on the
    hidden set, and the target’s ranking by the model. The model ranks from 0 to S (where S is the size of the Explorable Set) and picks the N
    ranked molecules every round. Note 0 is the highest ranked model and S is the lowest.
    Threshold: 0.77 Threshold: 0.61 Threshold: 0.45

    View Slide

  20. 20
    Results: Campaign Analysis
    For the overall best performing model (lipo pIC50 tuned), we show: the average Tanimoto distance of every molecule added to the the explorable set; and the
    amount of molecules added to the explorable set every round.
    Threshold: 0.77 Threshold: 0.61 Threshold: 0.45
    Note that even for the lowest threshold of 0.45, most of the molecules added had around 0.55 tanimoto similarity with the explorable set. This agrees with the
    study made on the Donepezil drug discovery campaign that found that humans made changes to molecules having on average 0.77 tanimoto distance between
    them; and a standard deviation of 0.16. This makes our 0.55 well within 2 standard deviations from the mean.

    View Slide

  21. 21
    Results: Campaign Analysis
    For IC50 only models, we know the ground truth activity, thus we can know which molecules have higher activity than Sitagliptin. We
    proceed to analyze during the campaign, what percentage of models with higher (better) activity than Sitagliptin were selected by the
    model each round. We also compare with the remaining percent in the full dataset.
    Threshold: 0.77 Threshold: 0.61 Threshold: 0.45
    We can see that by the end of the campaign (when Sitagliptin was selected around round 37-39), on average ~80% of the compounds with
    higher activity than Sitagliptin had been already selected. This means that the algorithm is able to select promising compounds, even
    more than the target, from early stages.

    View Slide

  22. 22
    Results: Full Table of Experiments
    * Beta refers to the acquisition function upper bound confidence described as: a(score) = score + beta ᐧ std. Beta serves as a way to balance exploitation (the highest predicted score) and
    exploration (std serves as a confidence measure for the model).
    • a Defines the best performance results within
    the 50 molecules vs 20 molecules per round
    experiments (first 6 in the table, described in
    the previous slides).
    • a Defines the best global performance
    achieved overall.
    Model
    Samples
    per
    Round
    Beta *
    Batch
    Size
    Learning
    Rate
    Epochs Threshold: 0.77 Threshold: 0.61 Threshold: 0.45
    50 molecules
    vs
    20 molecules
    per round
    experiments
    IC50 50 0.5 2048 1.00E-04 200 1620 1580 1580
    Lipo IC50 50 0.5 2048 1.00E-04 200 1530 1360 1275
    Random 50 NA 2048 1.00E-04 200 2240 2415 2035
    IC50 20 0.5 2048 1.00E-04 200 1496 1342 1336
    Lipo IC50 20 0.5 2048 1.00E-04 200 1470 1396 1440
    Random 20 NA 2048 1.00E-04 200 2722 2444 1780
    Hyper-
    parameter
    experiments
    Lipo IC50 50 0 2048 1.00E-04 200 1280 1405 1355
    Lipo pIC50 50 0 20 1.00E-03 50 1225 1265 1325
    Lipo pIC50 50 0.5 20 1.00E-03 50 1240 1330 1425
    Lipo pIC50 50 0.5 128 1.00E-03 80 1565 1505 1560

    View Slide