Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building an end-to-end open source data platform for clinical data

Building an end-to-end open source data platform for clinical data

PyData London 2023

Data engineering has experienced enormous growth in recent years, allowing for rapid progress and innovation as more people than ever are thinking about data resources and how to better leverage them. In this tutorial, we will build an end-to-end modern data platform for the analysis of medical data using open-source tools and libraries.

We will start with an overview of the platform components, including data warehousing, data integration, data transformation, data orchestration, and data visualization. We will then dive into each component, exploring the technologies and tools that make up the platform.

We will use Python-based tools such as DBT, Apache Airflow, Openmetadata, and Querybook to build the platform. We will walk through the process step-by-step, from creating a data warehouse to integrating data from multiple sources, transforming the data, orchestrating data workflows, and visualizing the data.

Attendees will benefit from this tutorial if they are interested in learning how to build an end-to-end modern data platform for biomedical data using Python-based tools. They will also benefit from learning about the open-source tools and libraries used in the tutorial, which they can then apply to their own data engineering projects.

No specific background knowledge is needed to attend this tutorial, although familiarity with Python and basic data engineering concepts will be helpful. All materials will be available on GitHub (https://github.com/bsc-health-data/pydatalondon23-modern-data-stack ), and attendees will have the opportunity to follow along and build the platform themselves.

Alberto Labarga

June 04, 2023
Tweet

More Decks by Alberto Labarga

Other Decks in Technology

Transcript

  1. Telecommunications Engineer Working as Head of Health Data @BSC More

    than 20 years teaching open data - open source – open science
  2. The modern data stack is one framework used to conceptualize

    how different data tools work together to allow a complete data journey.
  3. Modern Data Stack • Infraestructure as code • From Datawarehouse

    to Lakehouse • From ETL to ELT • Data governance and observability
  4. Modern Data Stack • Infraestructure as code • From Datawarehouse

    to Lakehouse • From ETL to ELT • Data governance and observability • Self-service analytics