Modern data observability - OpenLineage as the foundation of a modern Data Platform

Do it RIGHT. Hot take 🌶🔥

Do it RIGHT. Data platform builders 4 1. Without data
observability you can NOT have a data platform 2. If your data observability captures only the T in ELT (or ETL), you have none. Spice level: testing boundaries 7/10 🌶

Do it RIGHT. Target audience 5 • Data engineers •
Their bosses • Data platform builders in general • People who enjoy spicy food opinions

Do it RIGHT. What is meant by observability?

Do it RIGHT. Logging 7 From astronomer.io/docs/learn/logging/

Do it RIGHT. Metrics 8 From tantusdata.com/insights/monitoring-airﬂow-jobs-with-tig-1-system-metrics/

Do it RIGHT. Tracing 9 From cloud.google.com/architecture/microservices-architecture-distributed-tracing

Do it RIGHT. Pipeline observability

Do it RIGHT. Challenge: what is the status of the
consumers? 11 From grafana.com/solutions/apache-airﬂow/monitor/

Do it RIGHT. Data pipelines 12 Getting data can mean
many things

Do it RIGHT. Data pipelines 13 Hopefully, you have the
data provenance

Do it RIGHT. Data pipelines 14 You may store all
the data and even know all its schemas

Do it RIGHT. Data pipelines 15 You may even see
all DAGs, but do you know (all) the downstream consumers?

Do it RIGHT. Data pipelines 16 Or can you tell
when / how upstream changed?

Do it RIGHT. Is observability enough?

Do it RIGHT. Reactive alerts • Focus on alerting after
the problem has occurred. • “What does my metadata table say” Siloed monitoring • Logs, metrics, and traces are treated as isolated pieces. • They know nothing of each other. Costly failures • Issues detected too late can lead to missed SLAs, resulting in costly consequences. 18 Pain points of traditional observability

Do it RIGHT. Data lineage is not (distributed) tracing 19
• Important: data lineage focuses on the lifecycle of data within an organization's data ecosystem, while event tracing is more concerned with monitoring the ﬂow of individual events or requests through distributed systems for troubleshooting and performance analysis.

Do it RIGHT. Proactive alerts • Warn me about potential
errors in my work. Treat observability like data • Let me browse the metadata about my pipelines, just like I do any other data. Automatic SLAs • Calculate the SLAs from my SLO so I can focus on ﬁreﬁghting when I (really) need to. 20 What do we want?

Do it RIGHT. If I could make a wish We
need answers: 1. What is the data source? 2. What is the schema? 3. Who is the owner? 4. How often is it updated? 5. Where does it come from? 6. Who is using it? 7. What has changed? 8. & many more…

Do it RIGHT. Ask the audience

Do it RIGHT. Who here has? • Broken down larger
DAGs into smaller ones? • Tried to switch to more frequent pipelines in order to minimize failures and increase availability? • Tried adding a control DAG for end-to-end pipeline visibility? • Offered standardized task groups to your users for easier onboarding and management? • Written a metadata exporter (or two)?

Do it RIGHT. Congratulations, you are in great company 24
From the 2024 Airﬂow Summit by Astronomer

Do it RIGHT. Shift in perspective 25 From the 2024
Airﬂow Summit by Astronomer

Do it RIGHT. Shift in perspective 26 From the 2024
Airﬂow Summit by Astronomer

Do it RIGHT. How to capture the data?

Do it RIGHT. What happened during the pipeline run? 28
• The best moment in time to add metadata is the moment the data itself is captured and / or processed.

Do it RIGHT. OpenLineage.io 29 • A metadata repository reference
implementation (Marquez) • Libraries for common languages, integrations, with data pipeline tools An open platform for collection and analysis of data lineage

Do it RIGHT. Modern example of data lineage tooling 30
From marquezproject.io

Do it RIGHT. OpenLineage.io • Data lineage is the foundation
for a new generation of powerful, context-aware data tools and best practices. OpenLineage enables consistent collection of lineage metadata, creating a deeper understanding of how data is produced and used. An open standard for data lineage collection and analysis

Do it RIGHT. Dataset 32

Do it RIGHT. Dataset Facet 33 • Schema • Data
source • Life cycle state change (e.g., alter, create, drop, overwrite, rename, truncate) • Column lineage • Ownership • Version

Do it RIGHT. Job 35

Do it RIGHT. Job Facet 36 • Source code location
• Source code • Job type • Ownership • Query plan • Documentation • SQL query

Do it RIGHT. Run 37

Do it RIGHT. Run Facet 38 • Scheduled time •
Environment properties • Processing engine version • Parent (run) • Error message

Do it RIGHT. You DO NOT need data lineage…

Do it RIGHT. Nice calm life • If all of
your data lies within a single vendor, • If you use only one orchestrator, • If you use only one processing engine, • If you do not care about your data sources, • If you can change ﬁelds or their type without anyone realizing (before its too late), • If you are responsible both for the data preparation and the presentation.

Do it RIGHT. Industry status quo 41

Do it RIGHT. Our customers are very happy • Snowﬂake
(Polaris) • Databricks (Unity)

Do it RIGHT. We store our customers data • AWS
• Azure • GCP

Do it RIGHT. Astronomer 44 • Able to create data
products, collections of pipelines and data sources, which hold SLO and can calculate SLAs for you.

Do it RIGHT. Dagster 45 • Assets help focus the
pipelines on the results and ingredients, instead of the steps but it its nevertheless single tool oriented.

Do it RIGHT. IBM Databand 46 • Impressive out of
the box (promised) experience. But the army of SDKs and integraiton paths is custom, not based on OpenLineage.

Do it RIGHT. Foundational.io 47 They claim: • Automated integration
testing via dependency graphs and schema enforcement during PRs. • End-to-end data lineage, albeit only based on SQL code. • Policy enforcement, automated by code and based on data contracts.

Do it RIGHT. THANK YOU! CONTACT US: Weyringergasse 1-3/DG 1040
Wien www.posedio.com ofﬁ[email protected]

Modern data observability - OpenLineage as the ...

Modern data observability - OpenLineage as the foundation of a modern Data Platform

More Decks by Posedio

Other Decks in Programming

Featured

Transcript