Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Open Molecule Generator: A Multipurpose Molecul...

Elix
October 30, 2024

Open Molecule Generator: A Multipurpose Molecule LLM, Elix, CBI2024

Elix

October 30, 2024
Tweet

More Decks by Elix

Other Decks in Technology

Transcript

  1. 2 Introduction Goal: Train a Foundation Model with contextualized molecule

    information which can be fine-tuned to diverse drug-discovery downstream tasks. A Foundation Model is a type of large-scale neural network trained on vast and diverse datasets, designed to be general-purpose and adaptable to a wide range of tasks through fine-tuning. Steps: 1. Train an LLM with competitive architecture Foundation Model 2. Increase functionality through LoRA finetuning Foundation Model Downstream Tasks
  2. 3 Dataset: MolPajama V2 MolPajama V2 is mainly built from

    ChEMBL with: • molecules and a compilation of their respective physico-chemical properties • semantic structure description • known activities description • known protein targets Each sample is then formatted in a human-readable form. Anticipating potential downstream tasks that might require more complex understanding of the data, there were added 3 more types of samples added to the corpus with ADMET, Matched Molecular Pairs, and Retrosynthetic information. Mol Pajama
  3. 4 Model Training 4 Billion Parameter LLaMA Architecture > 65B

    Tokens Model Training Mol Input Tokens Order of LLM generation ecular weight 3 2 8 , logD 1 3 . Generation ecular weight 3 2 8 , logD 3 . weight 3 2 8 , logD 3 1 . 3 2 8 , logD 3 smiles 1 . 2 8 , logD 3 smiles COC1 Context Output
  4. 5 Pretraining Results: Constrained Molecule Generation Task Evaluating the model

    - Cross Entropy Loss. (Not too informative) However, the pretraining can be used for a Constrained Molecule Generation task 1. Supply a specific set of physico-chemical properties to the model (from test set) 2. Generate a molecule that should satisfy the conditions 3. Evaluate how well conditions are met Set of physico-chemical properties Molecule Input Context Generation Order of LLM generation
  5. 9 Substructure-based Generation Overall Task: Generate a molecule with desired

    –fixed– substructures and a variable substructure based on a description. • 1 ether oxygens (including phenoxy) • 1 benzene rings Ground Truth Example:
  6. 10 Substructure-based Generation Samples Overall Task: Generate a molecule with

    desired –fixed– substructures and a variable substructure based on a description. • 1 Tertiary amines • 1 halogens, Bicyclic • 1 aromatic nitrogens • 1 benzene rings • 1 pyridine rings Generated Ground Truth Reference
  7. 11 Substructure-based Generation Samples Task: Create a molecule with a

    single provided substructure. • 1 ether oxygens (including phenoxy) • 1 halogens • 1 methoxy groups -OCH3 • 1 benzene rings Generated Ground Truth Reference
  8. 12 Substructure-based Generation Samples • 2 amides • 1 halogens

    • 1 Primary amines • 1 Secondary amines • 1 primary amides • 2 Tertiary amines • 1 para-hydroxylation sites • 2 carbonyl O • 2 aromatic nitrogens • 1 benzene rings Generated Ground Truth Reference Task: Find a linker from a description to join 2 substructures Possible use contexts: PROTAC
  9. 13 Substructure-based Generation Samples Generate a molecule with a desired

    substructure based on a verbal description of the substructure. • 2 ether oxygens (including phenoxy) • 2 methoxy groups -OCH3 • 1 benzene rings Generated Ground Truth Reference
  10. 15 Conclusions • OMGeE was tested without fine-tuning for constrained

    molecule generation, achieving high performance with small error across 26 constraints • OMGeE was successfully fine-tuned for substructure based generation, showing adaptability to downstream tasks not originally present in the pre-training • This study opens the door for further leveraging of OMGeE for other downstream applications such as: ◦ One-step Retrosynthesis ◦ Pharmacophore based generation ◦ Matched-Molecular-Pairs-based optimization ◦ etc… • OMGeE weights and architecture is open-sourced for researchers to further extend the functionality • OMGeE and downstream functionality are coming soon to Elix DiscoveryTM
  11. 17 Downstream Tasks Once the Foundation Model is trained, it

    can be leveraged for downstream tasks using LoRA It can be trained on a small fraction of the time using a fraction of the resources
  12. A ChEMBL corpus sample can be divided into 5 parts:

    1. Physico-Chemical properties: logP, Mol Weight, QED, Num aromatic rings, Num of valence electrons, labute approximate surface area, etc. 2. Summary of substructures: number of epoxide rings, number of esters, number of ether oxygens, etc. 3. Similar molecules: Tanimoto sim > 0.8 (if available) 4. Activities and targets: activity type, pChEMBLvalue, value, target, protein sequence, description of the target (if available) 5. Molecule: depicted with multiple canonical and non-canonical SMILES representations 18 ChEMBL Corpus hydrogen bond donnors: 0, polar surface area: 61.83, num of radical electrons: 0, most Acid dissociation constants (pKa): 13.58, num of aliphatic heterocycles: 0, num of N or O (Nitrogens and Oxygens): 5, hydrogen bond acceptors: 5, num of aliphatic rings: 0, labute approximate surface area (LabuteASA): 140.71, logD: 3.12, num saturated rings: 0, num of aliphatic carbocycles: 0, molecule type: Small molecule, rule of 3: fail, num heavy atoms: 24, num of rings: 2, num heteroatoms: 5, molecular formula: C19H20O5, natural product likeness score: -0.54, full molecular weight: 328.36, molecular species: NEUTRAL, logP: 3.06, num aromatic heterocycles: 0, molecular weight monoisotopic: 328.1311, standard international chemical identifier (InChI): InChI=1S/C19H20O5/c1-22-16-9-6-14(7-10-16)8-11-19(21)24-13-18(20)15-4-3-5-17(12-15)23-2/h3-7,9 -10,12H,8,11,13H2,1-2H3, num aromatic rings: 2, quantitative estimate of drug-likeness (qed): 0.55, full molecular formula: C19H20O5, num lipinski rule of 5 (ro5) violations: 0, num saturated carbocycles: 0, fraction of SP3 hybridized C atoms: 0.26, num aromatic carbocycles: 2, molecular weight freebase: 328.36, num rotatable bonds: 8, num of NHs or OH: 0, Balaban’s J value (BalabanJ): 1.78, fragments: 2 benzene rings, 2 carbonyl O, excluding COOH, 1 ketones excluding diaryl, a,b-unsat. dienones, heteroatom on Calpha, 2 carbonyl O, 1 ketones, 1 esters, 1 aryl methyl sites for hydroxylation, 3 ether oxygens (including phenoxy), 2 methoxy groups -OCH3, num of valence electrons: 126, activities: activity type: Potency, pChEMBLvalue: 4.9, Potency=12589.3nM, target: <sequence>MSQEGDYGRWTISSSDESEEEKPKPDKPSTSSLLCARQGAANEPRYTCSEAQKAAHKRKISPVKFSNTDSVLPPKRQKSGSQED LGWCLSSSDDELQPEMPQKQAEKVVIKKEKDISAPNDGTAQRTENHGAPACHRLKEEEDEYETSGEGQDIWDMLDKGNPFQFYLTRVSGVKPKY NSGALHIKDILSPLFGTLVSSAQFNYCFDVDWLVKQYPPEFRKKPILLVHGDKREAKAHLHAQAKPYENISLCQAKLDIAFGTHHTKMMLLLYE EGLRVVIHTSNLIHADWHQKTQGIWLSPLYPRIADGTHKSGESPTHFKADLISYLMAYNAPSLKEWIDVIHKHDLSETNVYLIGSTPGRFQGSQ KDNWGHFRLKKLLKDHASSMPNAESWPVVGQFSSVGSLGADESKWLCSEFKESMLTLGKESKTPGKSSVPLYLIYPSVENVRTSLEGYPAGGSL PYSIQTAEKQNWLHSYFHKWSAETSGRSNAMPHIKTYMRPSPDFSKIAWFLVTSANLSKAAWGALEKNGTQLMIRSYELGVLFLPSAFGLDSFK VKQKFFAGSQEPMATFPVPYDLPPELYGSKDRPWIWNIPYVKAPDTHGNMWVPS</sequence>, description: Tyrosyl-DNA phosphodiesterase 1, SMILES: <smiles>COC1=CC=C(CCC(=O)OCC(=O)C2=CC=CC(OC)=C2)C=C1</smiles>, <smiles>COc1ccc(CCC(=O)OCC(=O)c2cccc(OC)c2)cc1</smiles>, <smiles>COc1ccc(CCC(=O)OCC(=O)c2cccc(OC)c2)cc1</smiles> Actual Sample
  13. 19 ADMET Matched Molecular Pairs (MMP) Corpus • A dataset

    of molecule pairs and their associated ADMET properties • Constructed using the library Therapeutics Data Commons (TDC) * • Containing 22 ADMET metrics: • Caco-2 (Cell Effective Permeability), • Absorption PAMPA Permeability, • Distribution BBB (Blood-Brain Barrier), • Metabolism CYP P450 2C19 Inhibition, • Excretion Clearance Hepatocyte, etc. ~ 4.2M samples * Velez-Arce, Alejandro, and Huang, Kexin, and Fu, Tianfan and Gao, Wenhao and Zhao, Yue and Roohani, Yusuf and Leskovec, Jure and Coley,Connor W and Xiao, Cao and Sun, Jimeng and Zitnik, Marinka; Therapeutic Data Commons - PyTDC; https://pypi.org/project/PyTDC/ ; Zitnik Lab, Harvard University Matched molecular pairs by Murcko Scaffold Distribution BBB: 1 Caco-2: -4.84 Distribution BBB: 1 Caco-2: -5.79 Create a transformation path with the change in ADMET prop. Distribution BBB: no_change Caco-2: high→low MMP transformation: <smiles>CCCCNc1ccc(NCCCC)c2c1C(=O)c1ccccc1C2=O</smiles> -> <smiles>Cc1ccc2c(c1)C(=O)c1ccccc1C2=O</smiles>, Delta absorption Solubility: high->low Actual Sample
  14. 20 LogD, Solubility and CLint Matched Molecular Pairs (MMP) Corpus

    • A dataset of molecule pairs and their associated LogD, Solubility and CLint properties • Based on the dataset in the paper Transformer-Based Molecular Optimization Beyond Match Molecular Pairs * • Couples molecules based on: • single transformation differences • tanimoto similarity • Murcko Scaffolds ~ 7.1M samples * He J, Nittinger E, Tyrchan C, Czechtizky W, Patronov A, Bjerrum EJ, Engkvist O. Transformer-based molecular optimization beyond matched molecular pairs. J Cheminform. 2022 Mar 28;14(1):18. doi: 10.1186/s13321-022-00599-3. PMID: 35346368; PMCID: PMC8962145. Matched molecular pairs by Murcko Scaffold logD: 2.23 Sol: 2.06 CLint: 1.11 logD: 2.39 Sol: 1.65 CLint: 1.18 Create a transformation path with the change in ADMET prop. logD: (0.0, 0.2] Sol: high->low CLint: no_change MMP transformation: <smiles>Cc1nc2ccccc2c(=O)n1-c1ccc(OC2CCN(C3CCCC3)CC2)cc1</smiles> -> <smiles>Cc1nc2ccncc2c(=O)n1-c1ccc(OC2CCN(C3CCC3)CC2)cc1</smiles>, Delta logD: (-1.1, -0.9], Delta solubility: no_change, Delta Clint: no_change, source molecule logD: 2.17, target molecule logD: 1.11, source molecule solubility: 2.58, target molecule solubility: 2.91, source molecule Clint: 0.92, target molecule Clint: 0.86 Actual Sample
  15. 21 Retrosynthesis Corpus • Retrosynthesis task is decomposed into 1-step

    retrosynthesis subtasks • Based on the USPTO-50K reactions data • Two types of samples are generated: a. relating the product to the reactants with a human readable reaction-class name b. relating the product to the template; has retrosynthesis template, forward reaction, retrosynthesis transformation ~ 0.85M samples * He J, Nittinger E, Tyrchan C, Czechtizky W, Patronov A, Bjerrum EJ, Engkvist O. Transformer-based molecular optimization beyond matched molecular pairs. J Cheminform. 2022 Mar 28;14(1):18. doi: 10.1186/s13321-022-00599-3. PMID: 35346368; PMCID: PMC8962145. 'Retrosynthesis: class: Heterocycle Formation, retrosynthesis transformation: <smiles>O=C1CN(c2ccc(/C=C/c3cccnc3)cc2OCc2ccccc2)S(=O)(=O)N1</smiles> -> <smiles>O=C1CN(c2ccc(I)cc2OCc2ccccc2)S(=O)(=O)N1.C=Cc1cccnc1</smiles> Actual Sample 1 Product Reactants 'Retrosynthesis USPTO: main product: <smiles>COc1ccc(C2CC(c3cc(OC)c(OC)c(OC)c3)=NN2C(C)=O)cc1OCC(=O)Nc1nc2ccc(OC(F)(F)F)cc2s 1</smiles>, retrosynthesis template: <smarts>[#16;a:4]:[c:5](:[#7;a:6])-[NH;D2;+0:7]-[C;H0;D3;+0:1](-[C:2])=[O;D1;H0:3]>>O-[ C;H0;D3;+0:1](-[C:2])=[O;D1;H0:3].[#16;a:4]:[c:5](:[#7;a:6])-[NH2;D1;+0:7]</smarts>, forward reaction: <smarts>O-[C;H0;D3;+0:1](-[C:2])=[O;D1;H0:3].[#16;a:4]:[c:5](:[#7;a:6])-[NH2;D1;+0:7]>> [#16;a:4]:[c:5](:[#7;a:6])-[NH;D2;+0:7]-[C;H0;D3;+0:1](-[C:2])=[O;D1;H0:3]</smarts>, retrosynthesis transformation: <smiles>COc1ccc(C2CC(c3cc(OC)c(OC)c(OC)c3)=NN2C(C)=O)cc1OCC(=O)Nc1nc2ccc(OC(F)(F)F)cc2s 1</smiles> -> <smiles>COc1ccc(C2CC(c3cc(OC)c(OC)c(OC)c3)=NN2C(C)=O)cc1OCC(=O)O.Nc1nc2ccc(OC(F)(F)F)cc 2s1</smiles> Actual Sample 2
  16. 22 Results: Tanimoto Similarity • Additionally, different temperatures were tried:

    0.1, 0.3, 0.5, and 0.8. (Smaller -> more peaked dist, larger -> fatter dist) • The Tanimoto similarity between the the property template molecule and the generated molecule is shown below The generation success rate, i.e.: rate of valid molecules generated, was 0.92, 0.93, 0.92, 0.88 for temperatures of 0.1, 0.3, 0.5, and 0.8 respectively.
  17. 23 Preparing a Dataset: Splitting the Molecule Starting from a

    Molecule: ✄ ✄ Each fragment has min 4 heavy atoms, and there are max 3 fragments generated.
  18. 24 Preparing a Dataset: Sticking Together the Pieces Sample: <s>Fragment

    decomposition: Fragment 1: <smarts>[1*]c1cccc(OC)c1OC</smarts>, Fragment 2: <smarts>[1*]C(=O)O[2*]</smarts>, Fragment 3: 1 ketones, 1 halogens, 1 carbonyl O, 1 ketones excluding diaryl, a,b-unsat. dienones, heteroatom on Calpha, 1 thiophene rings; w connections: [2]; SMILES: <smiles>COc1cccc(C(=O)OCC(=O)c2ccc(Br)s2)c1OC</smiles>, <smiles>COc1cccc(C(=O)OCC(=O)c2ccc(Br)s2)c1OC</smiles>, <smiles>COC1=CC=CC(C(=O)OCC(=O)C2=CC=C(Br)S2)=C1OC</smiles> Input Output