known drug (“target molecule”) with detailed description of the discovery process • Collect publicly available data on the protein target of the target molecule • Filter out the training set to exclude molecules similar to the target molecule • Train predictive and generative models • Observe if scaffold rediscovery is successful Target Molecule Selection Criteria: • NOT a kinase inhibitor • Well-documented drug design process • Diverse dataset (not focused on single moiety derivative compounds)
containing compounds from pre-training set • Exclusion of donepezil scaffold and relevant molecules from training set Dataset Curation • Pre-training set: ◦ ChEMBL data ◦ Objective: Learn SMILES vocabulary • Training-set: ◦ AChE inhibitors from ChEMBL database Predictive Model Training • Single model for activity prediction in generation step • 10 model ensemble for activity prediction in post-processing • BBB Permeability model ensemble of 5 models: ◦ Trained on a curated dataset of 9059 samples Generative model Training • In house developed SmilesFormer model • Pre-trained on the cleaned ChEMBL data • Fine-tuned on the cleaned activity data Generation and data analysis • Generate 30K molecules/run • Phys-chem filters • MCF filters • Novelty filters • BBB permeability prediction confidence filter • Activity prediction confidence filter • Scaffold grouping and rankings
524 120 120 120 117 112 265 220 207 156 150 Number of molecules containing the structure (Legend) Extract unique scaffolds from training set Search training set for substructure matches to each scaffold
QED score • Favorable physical-chemical properties • Novelty (distance from the training set) • Activity Generative Score: • Average of the normalized single scores (SA, QED, phys-chem, novelty, activity) was computed for each generated molecule • Molecules with the highest “generative score” were prioritized during generation process • Up to 30K molecules with highest scores were generated in each sampling run • 10 sampling runs were performed in total
30K molecules each Run 1 Run 2 Run 5 Run 3 Run 4 Run 6 Run 7 Run 10 Run 8 Run 9 Top 20 scaffolds each 20 most frequent scaffolds 7 Run 1 Run 2 Run 5 Run 3 Run 4 Run 6 Run 7 Run 10 Run 8 Run 9 Combined scaffolds from all runs 6 7 7 Filtering Steps
• Allowed common atoms • Ring size (up to 8) • Medicinal chemistry filters (189 filters) 2) Novelty • Avoid building upon known scaffolds (tacrine and physostigmine). • Remove molecules with exact scaffold match to the training set • Remove molecules with > 0.5 tanimoto similarity score 3) BBB Permeability • Choose molecules based on BBB permeability prediction probability threshold • Value used: ◦ 0.99 4) Activity Prediction Confidence • Choose top n percent of the molecules based on pIC50 prediction confidence • Values used: ◦ 50% ◦ 40% ◦ 30% ◦ 20%
• Group molecules sharing the same scaffold • Rank scaffolds by a “desirability score”: ◦ (QED + pIC50)/2 6) Combine Multiple Runs • Combine top 20 scaffolds from each of 10 sampling runs 7) Most consistent suggestions • Rank final list by number of occurrences
used to discover novel scaffolds (distant from the training set) • During 10 runs ~30K molecules were generated in each run • Molecules in each run were filtered to a short list of 20 scaffolds. • Donepezil scaffold consistently ranked amongst the top 20 scaffolds • Donepezil scaffold was represented by a molecule originally described as one of the intermediary molecules (Compound 14) that led to the donepezil discovery[1] [1] Sugimoto H. et al. Jpn. J. Pharmacol. 89, 7 – 20 (2002)
scaffolds: many scaffolds were represented by very few molecules. ◦ Generated molecules were mostly predicted to be BBB permeable, without explicit optimization for this parameter. ◦ Activity prediction models struggled when predicting on a chemical space too distant from the training set ◦ Filtering by the prediction confidence helped to focus on molecules with more confidence in predicted IC50 values.