Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Generating synthetic data from OMOP-CDM databases for health applications

Generating synthetic data from OMOP-CDM databases for health applications

Generating synthetic data from OMOP-CDM databases for health applications

Alberto Labarga

September 19, 2023
Tweet

More Decks by Alberto Labarga

Other Decks in Research

Transcript

  1. Generating synthetic data from OMOP-CDM
    databases for health applications
    Alberto Labarga

    View Slide

  2. Telecommunications Engineer
    Head of Biomedical Data Hub @BSC
    More than 20 years crunching data
    open data - open source – open science

    View Slide


  3. Data
    Sustain core data resources

    Tools
    Services & connectors to drive access
    and exploitation

    Compute
    Access, Exchange & Compute on
    sensitive data

    Standards
    Integration and interoperability of data
    and services.

    Training
    Professional skills for managing and
    exploiting data
    European life sciences infrastructure

    View Slide

  4. Machine Learning Focus Group

    Standards for Machine Learning
    – This includes aspects such as controlled terminology/ontology and services for ML model description and sharing,
    alignment to the ELIXIR Tools and Interoperability Platforms, as well as defining best practices for Machine Learning-
    related reviewing.

    Machine Learning and reproducibility
    – This area focuses on the definition of the best practices for developing, sharing and reusing Machine Learning approaches
    (including, but not limited to, Machine Learning models, algorithms, frameworks and protocols including the DOME
    recommendations ), while at the same time involving the existing approaches in the ELIXIR Tools Platform.

    Benchmarking of Machine Learning tools
    – In order to facilitate clear and objective comparison of ML-based tools, it is important to establish a benchmarking
    protocol; this may include datasets, protocols and services offered by the ELIXIR Tools Platform.

    Training for Machine Learning
    – Machine Learning has been identified by the ELIXIR Training Platform gap analysis task as an existing need. As such, a
    particular area of focus for this group will be to design and produce training resources for supporting the ELIXIR
    community, based on the standards and approaches established by the ELIXIR Training Platform.

    View Slide


  5. Academic and industrial researchers should have:

    access to relevant, robust, and generalisable synthetic
    data generation methodologies

    access to relevant, high quality synthetic datasets

    Thanks to better availability of robust synthetic datasets
    for training data models, healthcare providers and
    industry should have a wider range of performant AI-
    based and other data-driven tools to support diagnostics,
    personalised treatment decision-making and prediction
    of health outcomes.
    Synthetic data

    View Slide

  6. View Slide


  7. Generation

    Evaluation

    Publication

    View Slide

  8. OMOP-CDM

    The Observational Medical Outcomes Partnership (OMOP) Common
    Data Model (CDM) is an open community data standard, designed to
    standardize the structure and content of observational data and to
    enable efficient analyses that can produce reliable evidence. A central
    component of the OMOP CDM is the OHDSI standardized vocabularies.

    The OHDSI vocabularies allow organization and standardization of
    medical terms to be used across the various clinical domains of the
    OMOP common data model and enable standardized analytics that
    leverage the knowledge base when constructing exposure and outcome
    phenotypes and other features within characterization, population-level
    effect estimation, and patient-level prediction studies.

    View Slide

  9. OMOP-CDM
    Open source tools for clinical data
    analysis (cohort generation,
    characterization, incidence,
    estimation, prediction)
    International community to develop
    tools, standards and best practices
    recommendation in health data
    normalization and interoperability
    European federated data network
    for real world data based evidence
    generation

    View Slide

  10. View Slide

  11. View Slide

  12. https://github.com/synthetichealth/synthea/

    View Slide

  13. https://github.com/synthetichealth/synthea/wiki/Module-Gallery

    View Slide

  14. https://github.com/OHDSI/ETL-Synthea

    View Slide

  15. pysynth

    View Slide

  16. View Slide

  17. View Slide

  18. https://github.com/deepmedicine/BEHRT

    View Slide

  19. View Slide

  20. https://github.com/cumc-dbmi/cehr-bert

    View Slide

  21. Incorporating temporal information from structured EHR data to improve prediction tasks

    View Slide

  22. https://github.com/rpoulain/CEHR-GAN-BERT

    View Slide

  23. Few-Shot Learning with Semi-Supervised Transformers for Electronic Health Records

    View Slide

  24. https://blog.research.google/2022/12/ehr-safe-generating-high-fidelity-and.html

    View Slide

  25. Generating high-fidelity and privacy-preserving synthetic electronic health records

    View Slide

  26. View Slide

  27. ENCODER DECODER
    GENERATOR
    DISCRIMINAT
    OR
    random vector

    View Slide

  28. FIDELITY UTILITY PRIVACY
    How similar is this synthetic
    data as compared to the
    original training sets
    Kullback-Leibler (KL)
    divergence, pairwise
    correlation difference
    How useful is this synthetic
    data for our downstream
    machine learning
    applications
    Accuracy, F1-score, ROC,
    and AUC-ROC
    Has any sensitive data
    been inadvertently
    synthesized by our
    model
    Membership inference,
    re-identification and
    attribute inference
    attacks

    View Slide

  29. View Slide