Upgrade to PRO for Only $50/Year—Limited-Time Offer! 🔥

kMoL: An Open-source Machine and Federated Lear...

Avatar for Elix Elix
July 18, 2025

kMoL: An Open-source Machine and Federated Learning Library for Drug Discovery

Avatar for Elix

Elix

July 18, 2025
Tweet

More Decks by Elix

Other Decks in Technology

Transcript

  1. kMoL: An Open-source Machine and Federated Learning Library for Drug

    Discovery Haris Hasić, Research Engineer @ Elix, Inc. Federated Learning Virtual Seminar | July 18th, 2025 | Copyright © Elix, Inc. All rights reserved. Any unauthorized reproduction or redistribution of this material is strictly prohibited.
  2. Agenda ■ Section I: kMoL • Federated Learning • kMoL

    Introduction • Machine Learning with kMoL • Federated Learning with kMoL • Additional Information ■ Section II: Federated Learning in Drug Discovery • Initiatives and Libraries • kMoL in the Context of Federated Learning in Drug Discovery ■ Section III: Conclusion • Conclusion ■ Section IV: Appendix • Benchmarking Experiment Results • Federated Learning Experiment Results Federated Learning Virtual Seminar | July 18th, 2025 | Copyright © Elix, Inc. All rights reserved. 2
  3. Section I: kMoL Federated Learning Virtual Seminar | July 18th,

    2025 | Copyright © Elix, Inc. All rights reserved.
  4. Federated Learning ■ Data is considered a critical asset for

    any company and its protection is paramount. ■ On the other hand, collaboration between companies and academic or research organizations continues to be the driving force of innovation and progress. ■ Consequently, data privacy and security concerns are constantly at odds with collaboration efforts that aim to obtain more performant and robust models. ■ Federated learning represents a potential solution to this conflict of interests. Federated Learning Virtual Seminar | July 18th, 2025 | Copyright © Elix, Inc. All rights reserved. 4 In the case of centralized learning, the private data is shared and aggregated on a single machine, which is then utilized to train and distribute the global model. In the case of federated learning, the private data is utilized to train the local models, which are then shared and aggregated on a single machine to establish the global model.
  5. kMoL Introduction ■ kMoL is an open-source machine learning library

    with integrated federated learning capabilities developed primarily for drug discovery pipelines, and it consists of: • a machine learning package called kMoL and • a federated learning package called Mila. ■ It demonstrates extensive customization options and advanced security features, allowing users to configure, optimize, and deploy models while securely sharing the model parameters. Federated Learning Virtual Seminar | July 18th, 2025 | Copyright © Elix, Inc. All rights reserved. 5 In the case of centralized learning, the private data is shared and aggregated on a single machine, which is then utilized to train and distribute the global model. In the case of federated learning, the private data is utilized to train the local models, which are then shared and aggregated on a single machine to establish the global model.
  6. Machine Learning with kMoL: Data Pre-processing ■ In general, the

    users interact with kMoL by running pipelines which, by default, consist of: • a data pre-processing and • a data analysis and execution workflow. ■ The primary objective of the data pre-processing workflow is to prepare the data samples for the data analysis and execution workflow. ■ The components of the data pre-processing workflow are: • loaders which load the data from the disk into memory, • featurizers which convert the data to input features, • transformers which convert the data to output features, • splitters which divide the data into distinct groups based on a criteria, and • streamers which integrate all components into a cohesive data pre-processing workflow. Federated Learning Virtual Seminar | July 18th, 2025 | Copyright © Elix, Inc. All rights reserved. 6
  7. Machine Learning with kMoL: Data Analysis and Execution ■ The

    machine learning operations of the data analysis and execution workflow are: • training using fixed parameters, • hyperparameter tuning using Bayesian optimization, • standard validation, • cross-validation, • inference, and • explanation of inference results. ■ The federated learning operations of the data analysis and execution workflow are: • launching a server and • connecting to a server as a client. Federated Learning Virtual Seminar | July 18th, 2025 | Copyright © Elix, Inc. All rights reserved. 7
  8. Federated Learning with kMoL: Communication and Security ■ Similar to

    NVIDIA Clara, [1, 2] the communication between the clients and the servers is established utilizing the gRPC framework and protocol buffers. [3] ■ It is conducted utilizing SSL/TLS with client and server certificate chains, with the possibility of whitelisting and blacklisting IP addresses on the server side. ■ Differential privacy is supported utilizing the Opacus library. [4] ■ This implementation is not dependent on a specific framework or hardware. Federated Learning Virtual Seminar | July 18th, 2025 | Copyright © Elix, Inc. All rights reserved. 8 [1] https://docs.nvidia.com/clara/index.html. Accessed on July 18th, 2025. [2] https://developer.nvidia.com/blog/federated-learning-clara. Accessed on July 18th, 2025. [3] https://grpc.io. Accessed on July 18th, 2025. 2025. [4] https://opacus.ai. Accessed on July 18th, 2025.
  9. Federated Learning with kMoL: Model Training and Aggregation Federated Learning

    Virtual Seminar | July 18th, 2025 | Copyright © Elix, Inc. All rights reserved. 9 In the case of plain averaging, the model checkpoint weights from each client are utilized as they are. In the case of weighted averaging, the model checkpoint weights from each client are typically adjusted by the ratio of the utilized dataset samples, though custom ratios are also possible. In the case of benchmarked averaging, a benchmark dataset is utilized to evaluate the performance of the model checkpoints from each client in each round. The model checkpoint weights are then adjusted by the ratio of the performance of each client calculated utilizing a user specified metric.
  10. Additional Information ■ For additional information on kMoL, please refer

    to the official publication in the BMC Journal of Cheminformatics [5] and the official GitHub repository. [6] ■ The benchmarking experiments in the official publication were conducted utilizing the MoleculeNet [7] datasets. ■ The federated learning experiments in the official publication were conducted utilizing various Toxicity and ADME datasets in the: • client, • epoch, and • balance settings. Federated Learning Virtual Seminar | July 18th, 2025 | Copyright © Elix, Inc. All rights reserved. 10 [5] Cozac, R. et al. J. Cheminform. 17, 22, 2025. [6] https://github.com/elix-tech/kmol. Accessed on July 18th, 2025. [7] Wu, Z. et al. Chem. Sci., 2018, 9, 513-530.
  11. Section II: Federated Learning in Drug Discovery Federated Learning Virtual

    Seminar | July 18th, 2025 | Copyright © Elix, Inc. All rights reserved.
  12. Initiatives and Libraries ■ The prominent federated learning in drug

    discovery initiatives include: • MELLODDY [8] (i.e., Amgen, Astellas, AstraZeneca, Bayer, Boehringer Ingelheim, GSK, Janssen, Merck, Novartis, and Servier), • K-MELLODDY [9] (i.e., Daewoong, Dong Wha, Samjin, Yuhan, JW, Jeil, Hanmi, and Huons), and • FLuID [10] (i.e., Sanofi, Merck, Bayer, Roche, Novartis, UCB, Takeda, GSK, and AstraZeneca). ■ The prominent federated learning in drug discovery library is NVIDIA Clara. [1, 2] ■ The general purpose federated learning libraries include: • PySyft, [11] • FATE, [12] • Flower, [13] • TensorFlow Federated, [14] • OpenFL, [15] • NVIDIA FLARE, [16], and • Substra. [17] Federated Learning Virtual Seminar | July 18th, 2025 | Copyright © Elix, Inc. All rights reserved. 12 [1] https://docs.nvidia.com/clara/index.html. Accessed on July 18th, 2025. [2] https://developer.nvidia.com/blog/federated-learning-clara. Accessed on July 18th, 2025. [8] Heyndrickx, W. et al. J. Chem. Inf. Model., 2024, 64, 7, 2331-2344. [9] https://kmelloddy.org. Accessed on July 18th, 2025. [10] Hanser, T. et al. Nat. Mach. Intell., 7, 423-436, 2025. [11] https://docs.openmined.org/en/latest. Accessed on July 18th, 2025. [12] Liu, Y. et al. Journal of Machine Learning Research, 2021, 22, 226, 1-6. [13] Beutel, D.J. et al. arXiv, 2007.14390 [cs.LG]. [14] https://github.com/google-parfait/tensorflow-federated. Accessed on July 18th, 2025. [15] Foley, P. et al. Phys. Med. Biol., 67 214001, 2022. [16] https://developer.nvidia.com/flare. Accessed on July 18th, 2025. [17] https://github.com/substra. Accessed on July 18th, 2025.
  13. Initiatives and Libraries: MELLODDY and K-MELLODDY ■ As a part

    of the MELLODDY [8] initiative, 10 pharmaceutical companies realized aggregated improvements in machine learning model performance utilizing federated learning. ■ The experiments that were conducted utilizing a combined dataset of ~2.6 billion confidential experimental activity data points, ~21 million small molecules, and ~40 thousand assays in on-target and secondary pharmacodynamics and pharmacokinetics. ■ The K-MELLODDY [9] initiative is a national initiative in South Korea directly inspired by the MELLODDY initiative. Federated Learning Virtual Seminar | July 18th, 2025 | Copyright © Elix, Inc. All rights reserved. 13 [8] Heyndrickx, W. et al. J. Chem. Inf. Model., 2024, 64, 7, 2331-2344. [9] https://kmelloddy.org. Accessed on July 18th, 2025.
  14. Initiatives and Libraries: FLuID ■ As a part of the

    FLuID [10] initiative, 8 pharmaceutical companies realized aggregated improvements in machine learning model performance utilizing federated learning with knowledge distillation. ■ The participants trained the teacher models utilizing an average of ~10 thousand confidential instances with a range of contributions from ~3 thousand to ~70 thousand instances. ■ The experiments that were conducted have demonstrated the successful extraction, transfer and federation of knowledge in the context of hERG activity data classification. Federated Learning Virtual Seminar | July 18th, 2025 | Copyright © Elix, Inc. All rights reserved. 14 [10] Hanser, T. et al. Nat. Mach. Intell., 7, 423-436, 2025.
  15. kMoL in the Context of Federated Learning in Drug Discovery

    ■ What is the status of kMoL [5, 6] in the context of federated learning in drug discovery? 1.kMoL is an open-source library. 2.kMoL is developed primarily for drug discovery pipelines, but it is also highly customizable in that context. 3.In comparison to the MELLODDY, [8] K-MELLODDY, [9] and FLuID [10] initiatives, kMoL does not require a consortium membership to be utilized. 4.In comparison to the NVIDIA Clara [1, 2] library, kMoL is lightweight and not dependent on specific hardware or software. 5.kMoL is experimentally verified utilizing public data, while the the verification utilizing private data is in currently progress. 6.kMoL has a wide range of application possibilities, from small research projects to large- scale industrial collaborations. Federated Learning Virtual Seminar | July 18th, 2025 | Copyright © Elix, Inc. All rights reserved. 15 [1] https://docs.nvidia.com/clara/index.html. Accessed on July 18th, 2025. [2] https://developer.nvidia.com/blog/federated-learning-clara. Accessed on July 18th, 2025. [5] Cozac, R. et al. J. Cheminform. 17, 22, 2025. [6] https://github.com/elix-tech/kmol. Accessed on July 18th, 2025. [8] Heyndrickx, W. et al. J. Chem. Inf. Model., 2024, 64, 7, 2331-2344. [9] https://kmelloddy.org. Accessed on July 18th, 2025. [10] Hanser, T. et al. Nat. Mach. Intell., 7, 423-436, 2025.
  16. Section III: Conclusion Federated Learning Virtual Seminar | July 18th,

    2025 | Copyright © Elix, Inc. All rights reserved.
  17. Conclusion ■ kMoL [5, 6] is an open-source machine learning

    library with integrated federated learning capabilities developed primarily for drug discovery pipelines. ■ It demonstrates extensive customization options and advanced security features, allowing users to configure, optimize, and deploy models while securely sharing the model parameters. ■ kMoL ranks highly in the context of federated learning in drug discovery considering its open- source nature, balance between specificity and customizability, and lack of barriers to entry like consortium memberships or hardware and software requirements. Federated Learning Virtual Seminar | July 18th, 2025 | Copyright © Elix, Inc. All rights reserved. 17 [5] Cozac, R. et al. J. Cheminform. 17, 22, 2025. [6] https://github.com/elix-tech/kmol. Accessed on July 18th, 2025.
  18. Thank You for Your Attention !!! Q&A Federated Learning Virtual

    Seminar | July 18th, 2025 | Copyright © Elix, Inc. All rights reserved.
  19. References 1. NVIDIA Clara. https://docs.nvidia.com/clara/index.html. Accessed on July 18th, 2025.

    2. Federated Learning Powered by NVIDIA Clara. https://developer.nvidia.com/blog/federated-learning-clara. Accessed on July 18th, 2025. 3. gRPC. https://grpc.io. Accessed on July 18th, 2025. 4. Opacus. https://opacus.ai. Accessed on July 18th, 2025. 5. Cozac, R., Hasic, H., Choong, J.J., Richard, V., Beheshti, L., Froehlich, C., Koyama, T., Matsumoto, S., Kojima, R., Iwata, H., Hasegawa, A., Otsuka, T., and Okuno, Y. kMoL: An Open-source Machine and Federated Learning Library for Drug Discovery. J. Cheminform., 17, 22, 2025. 6. kMoL GitHub Repository. https://github.com/elix-tech/kmol. Accessed on July 18th, 2025. 7. Wu, Z., Ramsundar, B., Feinberg, E.N., Gomes, J., Geniesse, C., Pappu, A.S., Leswing, K., and Pande, V. MoleculeNet: A Benchmark for Molecular Machine Learning. Chem. Sci., 2018, 9, 513-530. 8. Heyndrickx, W., Mervin, L., Morawietz, T., Sturm, N., Friedrich, L., Zalewski, A., Pentina, A., Humbeck, L., Oldenhof, M., Niwayama, R., Schmidtke, P., Fechner, N., Simm, J., Arany, A., Drizard, N., Jabal, R., Afanasyeva, A., Loeb, R., Verma S., Harnqvist, S., Holmes, M., Pejo, B., Telenczuk, M., Holway, N., Dieckmann, A., Rieke, N., Zumsande, F., Clevert, D., Krug, M., Luscombe, C., Green, D., Ertl, P., Antal, P., Marcus, D., Do Huu, N., Fuji, H., Pickett, S., Acs, G., Boniface, E., Beck, B., Sun, Y., Gohier, A., Rippmann, F., Engkvist, O., Göller, A.H., Moreau, Y., Galtier, M.N., Schuffenhauer, A., and Ceulemans, H. MELLODDY: Cross-pharma Federated Learning at Unprecedented Scale Unlocks Benefits in QSAR without Compromising Proprietary Information. J. Chem. Inf. Model., 2024, 64, 7, 2331-2344. 9. Korea Machine Learning Ledger Orchestration For Drug Discovery. https://kmelloddy.org. Accessed on July 18th, 2025. 10. Hanser, T., Ahlberg, E., Amberg, A., Anger, L.T., Barber, C., Brennan, R.J., Brigo, A., Delaunois, A., Glowienke, S., Greene, N., Johnston, L., Kuhn, D., Kuhnke, L., Marchaland, J., Muster, W., Plante, J., Rippmann, F., Sabnis, Y., Schmidt, F., van Deursen, R., Werner, S., White, A., Wichard, J., and Yukawa, T. Data-driven Federated Learning in Drug Discovery with Knowledge Distillation. Nat. Mach. Intell., 7, 423-436, 2025. 11. PySyft: Data Science on Data You are Not Allowed to See. https://docs.openmined.org/en/latest. Accessed on July 18th, 2025. 12. Liu, Y., Fan, T., Chen, T., Xu, Q., and Yang, Q. FATE: An Industrial Grade Platform for Collaborative Learning With Data Protection. Journal of Machine Learning Research, 2021, 22, 226, 1-6. 13. Beutel, D.J., Topal, T., Mathur, A., Qiu, X., Fernandez-Marques, J., Gao, Y., Sani, L., Hei Li, K., Parcollet, T., Porto, P., de Gusmão, B., and Lane, N.D. Flower: A Friendly Federated Learning Research Framework. arXiv, 2007.14390 [cs.LG]. 14. TensorFlow Federated. https://github.com/google-parfait/tensorflow-federated. Accessed on July 18th, 2025. 15. Foley, P., Sheller, M.J., Edwards, B., Pati, S., Riviera, W., Sharma, M., Moorthy, P.N., Wang, S., Martin, J., Mirhaji, P., Shah, P., and Bakas, S. OpenFL: The Open Federated Learning Library. Phys. Med. Biol., 67 214001, 2022. 16. NVIDIA FLARE. https://developer.nvidia.com/flare. Accessed on July 18th, 2025. 17. Substra. https://github.com/substra. Accessed on July 18th, 2025. 18. The Next-Generation Drug Discovery AI Development through Drug Discovery Support Promotion Project and Industry-Academia Collaboration (DAIIA). https://www.amed.go.jp/program/list/11/02/001_02-04.html. Accessed on July 18th, 2025. 19. The Japan Agency for Medical Research and Development (AMED). https://www.amed.go.jp/aboutus/index.html. Accessed on July 18th, 2025. 20. Zdrazil, B., Felix, E., Hunter, F., Manners, E.J., Blackshaw, J., Corbett, S., de Veij, M., Ioannidis, H., Lopez, D.M., Mosquera, J.F., Magarinos, M.P., Bosc, N., Arcila, R., Kizilören, T., Gaulton, A., Bento, A.P., Adasme, M.F., Monecke, P., Landrum, G.A., and Leach, A.R. The ChEMBL Database in 2023: A Drug Discovery Platform Spanning Multiple Bioactivity Data Types and Time Periods. Nucleic Acids Research, 2024, 52, D1, D1180-D1192. Federated Learning Virtual Seminar | July 18th, 2025 | Copyright © Elix, Inc. All rights reserved. 19
  20. Section IV: Appendix Federated Learning Virtual Seminar | July 18th,

    2025 | Copyright © Elix, Inc. All rights reserved.
  21. Benchmarking Experiment Results: Classification Federated Learning Virtual Seminar | July

    18th, 2025 | Copyright © Elix, Inc. All rights reserved. 21
  22. Benchmarking Experiment Results: Regression Federated Learning Virtual Seminar | July

    18th, 2025 | Copyright © Elix, Inc. All rights reserved. 22
  23. Federated Learning Experiment Results: Datasets Federated Learning Virtual Seminar |

    July 18th, 2025 | Copyright © Elix, Inc. All rights reserved. 23
  24. Federated Learning Experiment Results: Cross-validation Federated Learning Virtual Seminar |

    July 18th, 2025 | Copyright © Elix, Inc. All rights reserved. 24
  25. Federated Learning Experiment Results: Classification Federated Learning Virtual Seminar |

    July 18th, 2025 | Copyright © Elix, Inc. All rights reserved. 25
  26. Federated Learning Experiment Results: Regression Federated Learning Virtual Seminar |

    July 18th, 2025 | Copyright © Elix, Inc. All rights reserved. 26