Generating synthetic data from OMOP-CDM databases for health applications

Generating synthetic data from OMOP-CDM databases for health applications Alberto
Labarga

Telecommunications Engineer Head of Biomedical Data Hub @BSC More than
20 years crunching data open data - open source – open science

• Data Sustain core data resources • Tools Services &
connectors to drive access and exploitation • Compute Access, Exchange & Compute on sensitive data • Standards Integration and interoperability of data and services. • Training Professional skills for managing and exploiting data European life sciences infrastructure

Machine Learning Focus Group • Standards for Machine Learning –
This includes aspects such as controlled terminology/ontology and services for ML model description and sharing, alignment to the ELIXIR Tools and Interoperability Platforms, as well as defining best practices for Machine Learning- related reviewing. • Machine Learning and reproducibility – This area focuses on the definition of the best practices for developing, sharing and reusing Machine Learning approaches (including, but not limited to, Machine Learning models, algorithms, frameworks and protocols including the DOME recommendations ), while at the same time involving the existing approaches in the ELIXIR Tools Platform. • Benchmarking of Machine Learning tools – In order to facilitate clear and objective comparison of ML-based tools, it is important to establish a benchmarking protocol; this may include datasets, protocols and services offered by the ELIXIR Tools Platform. • Training for Machine Learning – Machine Learning has been identified by the ELIXIR Training Platform gap analysis task as an existing need. As such, a particular area of focus for this group will be to design and produce training resources for supporting the ELIXIR community, based on the standards and approaches established by the ELIXIR Training Platform.

• Academic and industrial researchers should have: • access to
relevant, robust, and generalisable synthetic data generation methodologies • access to relevant, high quality synthetic datasets • Thanks to better availability of robust synthetic datasets for training data models, healthcare providers and industry should have a wider range of performant AI- based and other data-driven tools to support diagnostics, personalised treatment decision-making and prediction of health outcomes. Synthetic data

• Generation • Evaluation • Publication

OMOP-CDM • The Observational Medical Outcomes Partnership (OMOP) Common Data
Model (CDM) is an open community data standard, designed to standardize the structure and content of observational data and to enable efficient analyses that can produce reliable evidence. A central component of the OMOP CDM is the OHDSI standardized vocabularies. • The OHDSI vocabularies allow organization and standardization of medical terms to be used across the various clinical domains of the OMOP common data model and enable standardized analytics that leverage the knowledge base when constructing exposure and outcome phenotypes and other features within characterization, population-level effect estimation, and patient-level prediction studies.

OMOP-CDM Open source tools for clinical data analysis (cohort generation,
characterization, incidence, estimation, prediction) International community to develop tools, standards and best practices recommendation in health data normalization and interoperability European federated data network for real world data based evidence generation

https://github.com/synthetichealth/synthea/

https://github.com/synthetichealth/synthea/wiki/Module-Gallery

https://github.com/OHDSI/ETL-Synthea

pysynth

https://github.com/deepmedicine/BEHRT

https://github.com/cumc-dbmi/cehr-bert

Incorporating temporal information from structured EHR data to improve prediction
tasks

https://github.com/rpoulain/CEHR-GAN-BERT

Few-Shot Learning with Semi-Supervised Transformers for Electronic Health Records

https://blog.research.google/2022/12/ehr-safe-generating-high-fidelity-and.html

Generating high-fidelity and privacy-preserving synthetic electronic health records

ENCODER DECODER GENERATOR DISCRIMINAT OR random vector

FIDELITY UTILITY PRIVACY How similar is this synthetic data as
compared to the original training sets Kullback-Leibler (KL) divergence, pairwise correlation difference How useful is this synthetic data for our downstream machine learning applications Accuracy, F1-score, ROC, and AUC-ROC Has any sensitive data been inadvertently synthesized by our model Membership inference, re-identification and attribute inference attacks

Generating synthetic data from OMOP-CDM databas...

Generating synthetic data from OMOP-CDM databases for health applications

Alberto Labarga

More Decks by Alberto Labarga

Other Decks in Research

Featured

Transcript