Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Generating synthetic data from OMOP-CDM databases for health applications

Alberto Labarga
September 19, 2023

Generating synthetic data from OMOP-CDM databases for health applications

Generating synthetic data from OMOP-CDM databases for health applications

Alberto Labarga

September 19, 2023
Tweet

More Decks by Alberto Labarga

Other Decks in Research

Transcript

  1. Telecommunications Engineer Head of Biomedical Data Hub @BSC More than

    20 years crunching data open data - open source – open science
  2. • Data Sustain core data resources • Tools Services &

    connectors to drive access and exploitation • Compute Access, Exchange & Compute on sensitive data • Standards Integration and interoperability of data and services. • Training Professional skills for managing and exploiting data European life sciences infrastructure
  3. Machine Learning Focus Group • Standards for Machine Learning –

    This includes aspects such as controlled terminology/ontology and services for ML model description and sharing, alignment to the ELIXIR Tools and Interoperability Platforms, as well as defining best practices for Machine Learning- related reviewing. • Machine Learning and reproducibility – This area focuses on the definition of the best practices for developing, sharing and reusing Machine Learning approaches (including, but not limited to, Machine Learning models, algorithms, frameworks and protocols including the DOME recommendations ), while at the same time involving the existing approaches in the ELIXIR Tools Platform. • Benchmarking of Machine Learning tools – In order to facilitate clear and objective comparison of ML-based tools, it is important to establish a benchmarking protocol; this may include datasets, protocols and services offered by the ELIXIR Tools Platform. • Training for Machine Learning – Machine Learning has been identified by the ELIXIR Training Platform gap analysis task as an existing need. As such, a particular area of focus for this group will be to design and produce training resources for supporting the ELIXIR community, based on the standards and approaches established by the ELIXIR Training Platform.
  4. • Academic and industrial researchers should have: • access to

    relevant, robust, and generalisable synthetic data generation methodologies • access to relevant, high quality synthetic datasets • Thanks to better availability of robust synthetic datasets for training data models, healthcare providers and industry should have a wider range of performant AI- based and other data-driven tools to support diagnostics, personalised treatment decision-making and prediction of health outcomes. Synthetic data
  5. OMOP-CDM • The Observational Medical Outcomes Partnership (OMOP) Common Data

    Model (CDM) is an open community data standard, designed to standardize the structure and content of observational data and to enable efficient analyses that can produce reliable evidence. A central component of the OMOP CDM is the OHDSI standardized vocabularies. • The OHDSI vocabularies allow organization and standardization of medical terms to be used across the various clinical domains of the OMOP common data model and enable standardized analytics that leverage the knowledge base when constructing exposure and outcome phenotypes and other features within characterization, population-level effect estimation, and patient-level prediction studies.
  6. OMOP-CDM Open source tools for clinical data analysis (cohort generation,

    characterization, incidence, estimation, prediction) International community to develop tools, standards and best practices recommendation in health data normalization and interoperability European federated data network for real world data based evidence generation
  7. FIDELITY UTILITY PRIVACY How similar is this synthetic data as

    compared to the original training sets Kullback-Leibler (KL) divergence, pairwise correlation difference How useful is this synthetic data for our downstream machine learning applications Accuracy, F1-score, ROC, and AUC-ROC Has any sensitive data been inadvertently synthesized by our model Membership inference, re-identification and attribute inference attacks