Upgrade to Pro — share decks privately, control downloads, hide ads and more …

An Elix Discovery™ Case Study: Rediscovering Donepezil with an In-house Generative Model

Elix
June 29, 2022

An Elix Discovery™ Case Study: Rediscovering Donepezil with an In-house Generative Model

Elix

June 29, 2022
Tweet

More Decks by Elix

Other Decks in Technology

Transcript

  1. 1
    Elix DiscoveryTM: a rediscovery
    case study of Donepezil

    View Slide

  2. 2
    Goals
    ● Provide a use case of Elix DiscoveryTM Platform
    ● Focus on the generative models
    ● Focus on novel scaffolds:
    ○ Rediscover a scaffold of a known drug, but novel in terms of data used.

    View Slide

  3. 3
    Target Molecule Selection Criteria
    Problem design:
    ● Identify a known drug (“target molecule”) with detailed description of the discovery process
    ● Collect publicly available data on the protein target of the target molecule
    ● Filter out the training set to exclude molecules similar to the target molecule
    ● Train predictive and generative models
    ● Observe if scaffold rediscovery is successful
    Target Molecule Selection Criteria:
    ● NOT a kinase inhibitor
    ● Well-documented drug design process
    ● Diverse dataset (not focused on single moiety derivative compounds)

    View Slide

  4. 4
    Study Workflow
    Dataset Filtering
    ● Exclusion of
    donepezil scaffold
    containing
    compounds from
    pre-training set
    ● Exclusion of
    donepezil scaffold
    and relevant
    molecules from
    training set
    Dataset Curation
    ● Pre-training set:
    ○ ChEMBL data
    ○ Objective:
    Learn SMILES
    vocabulary
    ● Training-set:
    ○ AChE inhibitors
    from ChEMBL
    database
    Predictive Model
    Training
    ● Single model for
    activity prediction
    in generation step
    ● 10 model
    ensemble for
    activity prediction
    in post-processing
    ● BBB Permeability
    model ensemble of
    5 models:
    ○ Trained on a
    curated dataset
    of 9059 samples
    Generative model
    Training
    ● In house
    developed
    SmilesFormer
    model
    ● Pre-trained on
    the cleaned
    ChEMBL data
    ● Fine-tuned on
    the cleaned
    activity data
    Generation and
    data analysis
    ● Generate 30K
    molecules/run
    ● Phys-chem filters
    ● MCF filters
    ● Novelty filters
    ● BBB permeability
    prediction
    confidence filter
    ● Activity prediction
    confidence filter
    ● Scaffold grouping
    and rankings

    View Slide

  5. 5
    Donepezil (Aricept)
    ● Used for Alzheimer’s disease treatment
    ● Centrally acting reversible acetylcholinesterase (AChE) inhibitor
    Physostigmine
    Galantamine
    Tacrine
    Donepezil
    Rivastigmine
    Compound 8 (Backbone)
    Donepezil
    Compound 1 (Seed)
    N-Benzylpiperazine
    1-indanone N-Benzylpiperadine

    View Slide

  6. 6
    Training Set Filtering
    Filtered Pattern Filtered Substructures
    Group 1
    Group 2
    AND




    View Slide

  7. 7
    Training set distribution
    Tanimoto Similarity to Donepezil
    pIC50
    n : 3950

    View Slide

  8. 8
    Training set: Most abundant scaffolds
    3572 807 787 615 524
    120 120 120 117 112
    265 220 207 156 150
    Number of molecules
    containing the structure
    (Legend)
    Extract unique scaffolds
    from training set
    Search training set for
    substructure matches to
    each scaffold

    View Slide

  9. 9
    Training set: 10 most similar molecules to donepezil
    Legend: Tanimoto similarity to Donepezil
    0.443 0.443 0.441 0.435 0.433
    0.432 0.431 0.429 0.425 0.424

    View Slide

  10. 10
    Generation Procedure
    Multiobjective Optimization Problem:
    ● SA score
    ● QED score
    ● Favorable physical-chemical properties
    ● Novelty (distance from the training set)
    ● Activity
    Generative Score:
    ● Average of the normalized single scores (SA, QED, phys-chem, novelty, activity) was computed for each
    generated molecule
    ● Molecules with the highest “generative score” were prioritized during generation process
    ● Up to 30K molecules with highest scores were generated in each sampling run
    ● 10 sampling runs were performed in total

    View Slide

  11. 11
    Post-Processing Analysis Summary
    1 2 3 4 5 6
    30K molecules
    each
    Run 1
    Run 2
    Run 5
    Run 3
    Run 4
    Run 6
    Run 7
    Run 10
    Run 8
    Run 9
    Top 20
    scaffolds each
    20 most
    frequent
    scaffolds
    7
    Run 1
    Run 2
    Run 5
    Run 3
    Run 4
    Run 6
    Run 7
    Run 10
    Run 8
    Run 9
    Combined
    scaffolds
    from all
    runs
    6 7
    7
    Filtering Steps

    View Slide

  12. 12
    Post-Processing Analysis
    1) Phys-Chem & MCFs
    ● Lipinski’s RO5
    ● Allowed common
    atoms
    ● Ring size (up to 8)
    ● Medicinal chemistry
    filters (189 filters)
    2) Novelty
    ● Avoid building upon
    known scaffolds
    (tacrine and
    physostigmine).
    ● Remove molecules
    with exact scaffold
    match to the training
    set
    ● Remove molecules
    with > 0.5 tanimoto
    similarity score
    3) BBB
    Permeability
    ● Choose molecules
    based on BBB
    permeability
    prediction
    probability threshold
    ● Value used:
    ○ 0.99
    4) Activity
    Prediction
    Confidence
    ● Choose top n
    percent of the
    molecules based
    on pIC50
    prediction
    confidence
    ● Values used:
    ○ 50%
    ○ 40%
    ○ 30%
    ○ 20%

    View Slide

  13. 13
    Grouping and Ranking Analysis
    5) Scaffold Grouping &
    Ranking
    ● Group molecules
    sharing the same
    scaffold
    ● Rank scaffolds by a
    “desirability score”:
    ○ (QED + pIC50)/2
    6) Combine
    Multiple
    Runs
    ● Combine top 20
    scaffolds from each
    of 10 sampling runs
    7) Most
    consistent
    suggestions
    ● Rank final list by
    number of
    occurrences

    View Slide

  14. 14
    Results

    View Slide

  15. 15
    Top 50% by activity prediction confidence
    Legend: Frequency of
    generation among 10 runs

    View Slide

  16. 16
    Top 40% by activity prediction confidence
    Legend: Frequency of
    generation among 10 runs

    View Slide

  17. 17
    Top 30% by activity prediction confidence
    Legend: Frequency of
    generation among 10 runs

    View Slide

  18. 18
    Top 20% by activity prediction confidence
    Legend: Frequency of
    generation among 10 runs

    View Slide

  19. 19
    Generated Results with Donepezil Scaffold
    Donepezil scaffold Generated molecule Donepezil
    Compound 14 from the
    original Donepezil paper[1]
    [1] Sugimoto H. et al. Jpn. J. Pharmacol. 89, 7 – 20 (2002)

    View Slide

  20. 20
    Summary & Discussion [1]
    ● Elix DiscoveryTM Platform was used to discover novel scaffolds (distant from the training set)
    ● During 10 runs ~30K molecules were generated in each run
    ● Molecules in each run were filtered to a short list of 20 scaffolds.
    ● Donepezil scaffold consistently ranked amongst the top 20 scaffolds
    ● Donepezil scaffold was represented by a molecule originally described as one of the intermediary molecules
    (Compound 14) that led to the donepezil discovery[1]
    [1] Sugimoto H. et al. Jpn. J. Pharmacol. 89, 7 – 20 (2002)

    View Slide

  21. 21
    Summary & Discussion [2]
    ● Observations:
    ○ Diversity in scaffolds: many scaffolds were represented by very few molecules.
    ○ Generated molecules were mostly predicted to be BBB permeable, without explicit optimization for this
    parameter.
    ○ Activity prediction models struggled when predicting on a chemical space too distant from the training set
    ○ Filtering by the prediction confidence helped to focus on molecules with more confidence in predicted IC50
    values.

    View Slide

  22. 株式会社Elix
    http://ja.elix-inc.com/
    2

    View Slide