Slide 1

Slide 1 text

Open Molecule Generator by Elix (OMGeE): A Multipurpose Molecule LLM Elix, Inc. David Jimenez Barrero

Slide 2

Slide 2 text

2 Introduction Goal: Train a Foundation Model with contextualized molecule information which can be fine-tuned to diverse drug-discovery downstream tasks. A Foundation Model is a type of large-scale neural network trained on vast and diverse datasets, designed to be general-purpose and adaptable to a wide range of tasks through fine-tuning. Steps: 1. Train an LLM with competitive architecture Foundation Model 2. Increase functionality through LoRA finetuning Foundation Model Downstream Tasks

Slide 3

Slide 3 text

3 Dataset: MolPajama V2 MolPajama V2 is mainly built from ChEMBL with: • molecules and a compilation of their respective physico-chemical properties • semantic structure description • known activities description • known protein targets Each sample is then formatted in a human-readable form. Anticipating potential downstream tasks that might require more complex understanding of the data, there were added 3 more types of samples added to the corpus with ADMET, Matched Molecular Pairs, and Retrosynthetic information. Mol Pajama

Slide 4

Slide 4 text

4 Model Training 4 Billion Parameter LLaMA Architecture > 65B Tokens Model Training Mol Input Tokens Order of LLM generation ecular weight 3 2 8 , logD 1 3 . Generation ecular weight 3 2 8 , logD 3 . weight 3 2 8 , logD 3 1 . 3 2 8 , logD 3 smiles 1 . 2 8 , logD 3 smiles COC1 Context Output

Slide 5

Slide 5 text

5 Pretraining Results: Constrained Molecule Generation Task Evaluating the model - Cross Entropy Loss. (Not too informative) However, the pretraining can be used for a Constrained Molecule Generation task 1. Supply a specific set of physico-chemical properties to the model (from test set) 2. Generate a molecule that should satisfy the conditions 3. Evaluate how well conditions are met Set of physico-chemical properties Molecule Input Context Generation Order of LLM generation

Slide 6

Slide 6 text

6 Pretraining Results: Constrained Molecule Generation Task

Slide 7

Slide 7 text

7 Pretraining Results: Constrained Molecule Generation Task

Slide 8

Slide 8 text

8 Downstream Tasks Substructure-based Generation

Slide 9

Slide 9 text

9 Substructure-based Generation Overall Task: Generate a molecule with desired –fixed– substructures and a variable substructure based on a description. ● 1 ether oxygens (including phenoxy) ● 1 benzene rings Ground Truth Example:

Slide 10

Slide 10 text

10 Substructure-based Generation Samples Overall Task: Generate a molecule with desired –fixed– substructures and a variable substructure based on a description. ● 1 Tertiary amines ● 1 halogens, Bicyclic ● 1 aromatic nitrogens ● 1 benzene rings ● 1 pyridine rings Generated Ground Truth Reference

Slide 11

Slide 11 text

11 Substructure-based Generation Samples Task: Create a molecule with a single provided substructure. ● 1 ether oxygens (including phenoxy) ● 1 halogens ● 1 methoxy groups -OCH3 ● 1 benzene rings Generated Ground Truth Reference

Slide 12

Slide 12 text

12 Substructure-based Generation Samples ● 2 amides ● 1 halogens ● 1 Primary amines ● 1 Secondary amines ● 1 primary amides ● 2 Tertiary amines ● 1 para-hydroxylation sites ● 2 carbonyl O ● 2 aromatic nitrogens ● 1 benzene rings Generated Ground Truth Reference Task: Find a linker from a description to join 2 substructures Possible use contexts: PROTAC

Slide 13

Slide 13 text

13 Substructure-based Generation Samples Generate a molecule with a desired substructure based on a verbal description of the substructure. ● 2 ether oxygens (including phenoxy) ● 2 methoxy groups -OCH3 ● 1 benzene rings Generated Ground Truth Reference

Slide 14

Slide 14 text

14 Single Shot Substructure Generation

Slide 15

Slide 15 text

15 Conclusions ● OMGeE was tested without fine-tuning for constrained molecule generation, achieving high performance with small error across 26 constraints ● OMGeE was successfully fine-tuned for substructure based generation, showing adaptability to downstream tasks not originally present in the pre-training ● This study opens the door for further leveraging of OMGeE for other downstream applications such as: ○ One-step Retrosynthesis ○ Pharmacophore based generation ○ Matched-Molecular-Pairs-based optimization ○ etc… ● OMGeE weights and architecture is open-sourced for researchers to further extend the functionality ● OMGeE and downstream functionality are coming soon to Elix DiscoveryTM

Slide 16

Slide 16 text

株式会社Elix http://ja.elix-inc.com/ 16

Slide 17

Slide 17 text

17 Downstream Tasks Once the Foundation Model is trained, it can be leveraged for downstream tasks using LoRA It can be trained on a small fraction of the time using a fraction of the resources

Slide 18

Slide 18 text

A ChEMBL corpus sample can be divided into 5 parts: 1. Physico-Chemical properties: logP, Mol Weight, QED, Num aromatic rings, Num of valence electrons, labute approximate surface area, etc. 2. Summary of substructures: number of epoxide rings, number of esters, number of ether oxygens, etc. 3. Similar molecules: Tanimoto sim > 0.8 (if available) 4. Activities and targets: activity type, pChEMBLvalue, value, target, protein sequence, description of the target (if available) 5. Molecule: depicted with multiple canonical and non-canonical SMILES representations 18 ChEMBL Corpus hydrogen bond donnors: 0, polar surface area: 61.83, num of radical electrons: 0, most Acid dissociation constants (pKa): 13.58, num of aliphatic heterocycles: 0, num of N or O (Nitrogens and Oxygens): 5, hydrogen bond acceptors: 5, num of aliphatic rings: 0, labute approximate surface area (LabuteASA): 140.71, logD: 3.12, num saturated rings: 0, num of aliphatic carbocycles: 0, molecule type: Small molecule, rule of 3: fail, num heavy atoms: 24, num of rings: 2, num heteroatoms: 5, molecular formula: C19H20O5, natural product likeness score: -0.54, full molecular weight: 328.36, molecular species: NEUTRAL, logP: 3.06, num aromatic heterocycles: 0, molecular weight monoisotopic: 328.1311, standard international chemical identifier (InChI): InChI=1S/C19H20O5/c1-22-16-9-6-14(7-10-16)8-11-19(21)24-13-18(20)15-4-3-5-17(12-15)23-2/h3-7,9 -10,12H,8,11,13H2,1-2H3, num aromatic rings: 2, quantitative estimate of drug-likeness (qed): 0.55, full molecular formula: C19H20O5, num lipinski rule of 5 (ro5) violations: 0, num saturated carbocycles: 0, fraction of SP3 hybridized C atoms: 0.26, num aromatic carbocycles: 2, molecular weight freebase: 328.36, num rotatable bonds: 8, num of NHs or OH: 0, Balaban’s J value (BalabanJ): 1.78, fragments: 2 benzene rings, 2 carbonyl O, excluding COOH, 1 ketones excluding diaryl, a,b-unsat. dienones, heteroatom on Calpha, 2 carbonyl O, 1 ketones, 1 esters, 1 aryl methyl sites for hydroxylation, 3 ether oxygens (including phenoxy), 2 methoxy groups -OCH3, num of valence electrons: 126, activities: activity type: Potency, pChEMBLvalue: 4.9, Potency=12589.3nM, target: MSQEGDYGRWTISSSDESEEEKPKPDKPSTSSLLCARQGAANEPRYTCSEAQKAAHKRKISPVKFSNTDSVLPPKRQKSGSQED LGWCLSSSDDELQPEMPQKQAEKVVIKKEKDISAPNDGTAQRTENHGAPACHRLKEEEDEYETSGEGQDIWDMLDKGNPFQFYLTRVSGVKPKY NSGALHIKDILSPLFGTLVSSAQFNYCFDVDWLVKQYPPEFRKKPILLVHGDKREAKAHLHAQAKPYENISLCQAKLDIAFGTHHTKMMLLLYE EGLRVVIHTSNLIHADWHQKTQGIWLSPLYPRIADGTHKSGESPTHFKADLISYLMAYNAPSLKEWIDVIHKHDLSETNVYLIGSTPGRFQGSQ KDNWGHFRLKKLLKDHASSMPNAESWPVVGQFSSVGSLGADESKWLCSEFKESMLTLGKESKTPGKSSVPLYLIYPSVENVRTSLEGYPAGGSL PYSIQTAEKQNWLHSYFHKWSAETSGRSNAMPHIKTYMRPSPDFSKIAWFLVTSANLSKAAWGALEKNGTQLMIRSYELGVLFLPSAFGLDSFK VKQKFFAGSQEPMATFPVPYDLPPELYGSKDRPWIWNIPYVKAPDTHGNMWVPS, description: Tyrosyl-DNA phosphodiesterase 1, SMILES: COC1=CC=C(CCC(=O)OCC(=O)C2=CC=CC(OC)=C2)C=C1, COc1ccc(CCC(=O)OCC(=O)c2cccc(OC)c2)cc1, COc1ccc(CCC(=O)OCC(=O)c2cccc(OC)c2)cc1 Actual Sample

Slide 19

Slide 19 text

19 ADMET Matched Molecular Pairs (MMP) Corpus • A dataset of molecule pairs and their associated ADMET properties • Constructed using the library Therapeutics Data Commons (TDC) * • Containing 22 ADMET metrics: • Caco-2 (Cell Effective Permeability), • Absorption PAMPA Permeability, • Distribution BBB (Blood-Brain Barrier), • Metabolism CYP P450 2C19 Inhibition, • Excretion Clearance Hepatocyte, etc. ~ 4.2M samples * Velez-Arce, Alejandro, and Huang, Kexin, and Fu, Tianfan and Gao, Wenhao and Zhao, Yue and Roohani, Yusuf and Leskovec, Jure and Coley,Connor W and Xiao, Cao and Sun, Jimeng and Zitnik, Marinka; Therapeutic Data Commons - PyTDC; https://pypi.org/project/PyTDC/ ; Zitnik Lab, Harvard University Matched molecular pairs by Murcko Scaffold Distribution BBB: 1 Caco-2: -4.84 Distribution BBB: 1 Caco-2: -5.79 Create a transformation path with the change in ADMET prop. Distribution BBB: no_change Caco-2: high→low MMP transformation: CCCCNc1ccc(NCCCC)c2c1C(=O)c1ccccc1C2=O -> Cc1ccc2c(c1)C(=O)c1ccccc1C2=O, Delta absorption Solubility: high->low Actual Sample

Slide 20

Slide 20 text

20 LogD, Solubility and CLint Matched Molecular Pairs (MMP) Corpus • A dataset of molecule pairs and their associated LogD, Solubility and CLint properties • Based on the dataset in the paper Transformer-Based Molecular Optimization Beyond Match Molecular Pairs * • Couples molecules based on: • single transformation differences • tanimoto similarity • Murcko Scaffolds ~ 7.1M samples * He J, Nittinger E, Tyrchan C, Czechtizky W, Patronov A, Bjerrum EJ, Engkvist O. Transformer-based molecular optimization beyond matched molecular pairs. J Cheminform. 2022 Mar 28;14(1):18. doi: 10.1186/s13321-022-00599-3. PMID: 35346368; PMCID: PMC8962145. Matched molecular pairs by Murcko Scaffold logD: 2.23 Sol: 2.06 CLint: 1.11 logD: 2.39 Sol: 1.65 CLint: 1.18 Create a transformation path with the change in ADMET prop. logD: (0.0, 0.2] Sol: high->low CLint: no_change MMP transformation: Cc1nc2ccccc2c(=O)n1-c1ccc(OC2CCN(C3CCCC3)CC2)cc1 -> Cc1nc2ccncc2c(=O)n1-c1ccc(OC2CCN(C3CCC3)CC2)cc1, Delta logD: (-1.1, -0.9], Delta solubility: no_change, Delta Clint: no_change, source molecule logD: 2.17, target molecule logD: 1.11, source molecule solubility: 2.58, target molecule solubility: 2.91, source molecule Clint: 0.92, target molecule Clint: 0.86 Actual Sample

Slide 21

Slide 21 text

21 Retrosynthesis Corpus • Retrosynthesis task is decomposed into 1-step retrosynthesis subtasks • Based on the USPTO-50K reactions data • Two types of samples are generated: a. relating the product to the reactants with a human readable reaction-class name b. relating the product to the template; has retrosynthesis template, forward reaction, retrosynthesis transformation ~ 0.85M samples * He J, Nittinger E, Tyrchan C, Czechtizky W, Patronov A, Bjerrum EJ, Engkvist O. Transformer-based molecular optimization beyond matched molecular pairs. J Cheminform. 2022 Mar 28;14(1):18. doi: 10.1186/s13321-022-00599-3. PMID: 35346368; PMCID: PMC8962145. 'Retrosynthesis: class: Heterocycle Formation, retrosynthesis transformation: O=C1CN(c2ccc(/C=C/c3cccnc3)cc2OCc2ccccc2)S(=O)(=O)N1 -> O=C1CN(c2ccc(I)cc2OCc2ccccc2)S(=O)(=O)N1.C=Cc1cccnc1 Actual Sample 1 Product Reactants 'Retrosynthesis USPTO: main product: COc1ccc(C2CC(c3cc(OC)c(OC)c(OC)c3)=NN2C(C)=O)cc1OCC(=O)Nc1nc2ccc(OC(F)(F)F)cc2s 1, retrosynthesis template: [#16;a:4]:[c:5](:[#7;a:6])-[NH;D2;+0:7]-[C;H0;D3;+0:1](-[C:2])=[O;D1;H0:3]>>O-[ C;H0;D3;+0:1](-[C:2])=[O;D1;H0:3].[#16;a:4]:[c:5](:[#7;a:6])-[NH2;D1;+0:7], forward reaction: O-[C;H0;D3;+0:1](-[C:2])=[O;D1;H0:3].[#16;a:4]:[c:5](:[#7;a:6])-[NH2;D1;+0:7]>> [#16;a:4]:[c:5](:[#7;a:6])-[NH;D2;+0:7]-[C;H0;D3;+0:1](-[C:2])=[O;D1;H0:3], retrosynthesis transformation: COc1ccc(C2CC(c3cc(OC)c(OC)c(OC)c3)=NN2C(C)=O)cc1OCC(=O)Nc1nc2ccc(OC(F)(F)F)cc2s 1 -> COc1ccc(C2CC(c3cc(OC)c(OC)c(OC)c3)=NN2C(C)=O)cc1OCC(=O)O.Nc1nc2ccc(OC(F)(F)F)cc 2s1 Actual Sample 2

Slide 22

Slide 22 text

22 Results: Tanimoto Similarity • Additionally, different temperatures were tried: 0.1, 0.3, 0.5, and 0.8. (Smaller -> more peaked dist, larger -> fatter dist) • The Tanimoto similarity between the the property template molecule and the generated molecule is shown below The generation success rate, i.e.: rate of valid molecules generated, was 0.92, 0.93, 0.92, 0.88 for temperatures of 0.1, 0.3, 0.5, and 0.8 respectively.

Slide 23

Slide 23 text

23 Preparing a Dataset: Splitting the Molecule Starting from a Molecule: ✄ ✄ Each fragment has min 4 heavy atoms, and there are max 3 fragments generated.

Slide 24

Slide 24 text

24 Preparing a Dataset: Sticking Together the Pieces Sample: Fragment decomposition: Fragment 1: [1*]c1cccc(OC)c1OC, Fragment 2: [1*]C(=O)O[2*], Fragment 3: 1 ketones, 1 halogens, 1 carbonyl O, 1 ketones excluding diaryl, a,b-unsat. dienones, heteroatom on Calpha, 1 thiophene rings; w connections: [2]; SMILES: COc1cccc(C(=O)OCC(=O)c2ccc(Br)s2)c1OC, COc1cccc(C(=O)OCC(=O)c2ccc(Br)s2)c1OC, COC1=CC=CC(C(=O)OCC(=O)C2=CC=C(Br)S2)=C1OC Input Output

Slide 25

Slide 25 text

25 Single Shot Substructure Generation Partially Altered Scaffold E.g.:

Slide 26

Slide 26 text

株式会社Elix http://ja.elix-inc.com/ 2