Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Modern data observability

Modern data observability

OpenLineage as the foundation of a modern Data Platform

Avatar for Posedio

Posedio PRO

May 13, 2025
Tweet

More Decks by Posedio

Other Decks in Programming

Transcript

  1. Do it RIGHT. Data platform builders 4 1. Without data

    observability you can NOT have a data platform 2. If your data observability captures only the T in ELT (or ETL), you have none. Spice level: testing boundaries 7/10 🌶
  2. Do it RIGHT. Target audience 5 • Data engineers •

    Their bosses • Data platform builders in general • People who enjoy spicy food opinions
  3. Do it RIGHT. Challenge: what is the status of the

    consumers? 11 From grafana.com/solutions/apache-airflow/monitor/
  4. Do it RIGHT. Data pipelines 14 You may store all

    the data and even know all its schemas
  5. Do it RIGHT. Data pipelines 15 You may even see

    all DAGs, but do you know (all) the downstream consumers?
  6. Do it RIGHT. Data pipelines 16 Or can you tell

    when / how upstream changed?
  7. Do it RIGHT. Reactive alerts • Focus on alerting after

    the problem has occurred. • “What does my metadata table say” Siloed monitoring • Logs, metrics, and traces are treated as isolated pieces. • They know nothing of each other. Costly failures • Issues detected too late can lead to missed SLAs, resulting in costly consequences. 18 Pain points of traditional observability
  8. Do it RIGHT. Data lineage is not (distributed) tracing 19

    • Important: data lineage focuses on the lifecycle of data within an organization's data ecosystem, while event tracing is more concerned with monitoring the flow of individual events or requests through distributed systems for troubleshooting and performance analysis.
  9. Do it RIGHT. Proactive alerts • Warn me about potential

    errors in my work. Treat observability like data • Let me browse the metadata about my pipelines, just like I do any other data. Automatic SLAs • Calculate the SLAs from my SLO so I can focus on firefighting when I (really) need to. 20 What do we want?
  10. Do it RIGHT. If I could make a wish We

    need answers: 1. What is the data source? 2. What is the schema? 3. Who is the owner? 4. How often is it updated? 5. Where does it come from? 6. Who is using it? 7. What has changed? 8. & many more…
  11. Do it RIGHT. Who here has? • Broken down larger

    DAGs into smaller ones? • Tried to switch to more frequent pipelines in order to minimize failures and increase availability? • Tried adding a control DAG for end-to-end pipeline visibility? • Offered standardized task groups to your users for easier onboarding and management? • Written a metadata exporter (or two)?
  12. Do it RIGHT. Congratulations, you are in great company 24

    From the 2024 Airflow Summit by Astronomer
  13. Do it RIGHT. What happened during the pipeline run? 28

    • The best moment in time to add metadata is the moment the data itself is captured and / or processed.
  14. Do it RIGHT. OpenLineage.io 29 • A metadata repository reference

    implementation (Marquez) • Libraries for common languages, integrations, with data pipeline tools An open platform for collection and analysis of data lineage
  15. Do it RIGHT. OpenLineage.io • Data lineage is the foundation

    for a new generation of powerful, context-aware data tools and best practices. OpenLineage enables consistent collection of lineage metadata, creating a deeper understanding of how data is produced and used. An open standard for data lineage collection and analysis
  16. Do it RIGHT. Dataset Facet 33 • Schema • Data

    source • Life cycle state change (e.g., alter, create, drop, overwrite, rename, truncate) • Column lineage • Ownership • Version
  17. Do it RIGHT. Job Facet 36 • Source code location

    • Source code • Job type • Ownership • Query plan • Documentation • SQL query
  18. Do it RIGHT. Run Facet 38 • Scheduled time •

    Environment properties • Processing engine version • Parent (run) • Error message
  19. Do it RIGHT. Nice calm life • If all of

    your data lies within a single vendor, • If you use only one orchestrator, • If you use only one processing engine, • If you do not care about your data sources, • If you can change fields or their type without anyone realizing (before its too late), • If you are responsible both for the data preparation and the presentation.
  20. Do it RIGHT. Astronomer 44 • Able to create data

    products, collections of pipelines and data sources, which hold SLO and can calculate SLAs for you.
  21. Do it RIGHT. Dagster 45 • Assets help focus the

    pipelines on the results and ingredients, instead of the steps but it its nevertheless single tool oriented.
  22. Do it RIGHT. IBM Databand 46 • Impressive out of

    the box (promised) experience. But the army of SDKs and integraiton paths is custom, not based on OpenLineage.
  23. Do it RIGHT. Foundational.io 47 They claim: • Automated integration

    testing via dependency graphs and schema enforcement during PRs. • End-to-end data lineage, albeit only based on SQL code. • Policy enforcement, automated by code and based on data contracts.