Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Design Foundational Data Engineering Observability

Avatar for suci suci
September 06, 2025

Design Foundational Data Engineering Observability

摘要

Data pipelines face unique challenges, often failing silently or suffering unnoticed data quality degradation that impacts downstream analytics, ML models, and business decisions. Standard application monitoring falls short. Foundational data engineering observability requires a dedicated approach, monitoring system health (logs, metrics, traces), data pipeline jobs and data-centric viewpoints. This talk introduces the essential data pillars – Freshness, Volume, Distribution, Schema, and Lineage – and explores practical Python implementation approaches. I'll introduce foundational techniques using libraries like OpenTelemetry, data quality tools (e.g., Great Expectations, dbt test), and custom scripts/metrics to establish baseline monitoring. Building this solid foundation is the critical first step towards enabling the advanced data-driven insights and deep correlation associated with the next version of observability for data pipelines.

In PyConTW 2025

Avatar for suci

suci

September 06, 2025
Tweet

More Decks by suci

Other Decks in Programming

Transcript

  1. Find me on sciwork member Working in Smart Manufacturing &

    AI With data and people About Me Interested in • Agile/Engineering Culture/ Developer Experience • Team Coaching • Data Engineering
  2. Agenda 2 Takeaway + Q&A Data Observability Design Patterns The

    Evolution of Data Observability How to Do Stories in Smart Pizza & AI Common Data Engineering Challenges From Monitoring to Observability Observability and Data Observability Start with Basic
  3. Inspired from “ ( Joe Reis and Matt Housley, “Chapter

    1,” in Fundamentals of Data Engineering (O’Reilly Media, 2022))” ETL/ELT
  4. Simplistic Data Flow 6 Data Store/Application B Acquire/Ingest Data Process

    and Analyzed Data Data Store/Application A • Data movement as flow • Moving data content from A to B
  5. Data Engineering Challenges How to manage pipelines effectively • Complex

    data pipelines • Inability to centrally view - Limited data asset discoverability - Error detection and root cause analysis - Scattered monitoring and time-consuming troubleshooting - Optimization and monitoring workloads at scale - Inefficient pipelines that negatively
  6. From Monitoring to Observability Why Traditional Monitoring is Not Enough?

    . Monitoring Observability WHERE the issue is WHY it happened • Measure and report specific metrics in a system. • Reactive – collect data to identify abnormal systems. • WHEN and WHAT did the system error occur? • Smart Pizza & AI Example: ◦ Checking only if the pizza oven is on and the thermometer is working. ◦ Alerts when 'Order Sync Task failed' or 'Order DB CPU > 90%'. • Collect metrics, events, logs, and traces across distributed systems. • Proactive – investigate root causes of abnormal systems. • WHY and HOW did the system error occur? • Smart Pizza & AI Example: ◦ A smart oven that not only tracks temperature but also analyzes pizza color, dough rise, and past baking data to diagnose why it burned. ◦ The accuracy dropped due to a schema change in an upstream API that nullified the pizza_type field.
  7. Three Key Focus Areas of Data Observability Pipeline Data •

    Focus ◦ Hardware & services running pipelines • Metrics ◦ CPU, memory, disk, network • What we want to know ◦ Is the ML cluster (training AI models) overloaded? Infrastructure • Focus ◦ Data transfer & processing flow • Metrics ◦ Task duration, success/failure rate, retries • What we want to know ◦ Did the daily ETL for orders finish on time • Focus ◦ Data content, structure & quality • Metrics ◦ Freshness, volume, distribution, schema, lineage • What we want to know ◦ Are order fields valid? Any anomaly in new order volume?
  8. 12 Data Store/Application B Acquire/Ingest Data Process and Analyzed Data

    Data Store/Application A Data Observability Data Infrastructure (CPU/Memory/Disk/network/) Pipeline (Task duration/Retries Success & Failure Rate) Freshness, volume, distribution, schema, lineage
  9. Data Observability Design Pattern Data Detectors Flow Interruption Detector Skew

    Detector Time Detectors Lag Detector SLA Misses Detector Lineage Trackers Dataset Tracker Fine-Grained Tracker Konieczny, Bartosz. Data Engineering Design Patterns: Recipes for Solving the Most Common Data Engineering Problems. O'Reilly Media, Inc., 2025.
  10. Flow Interruption Detector Problem • Real-time order sync job ran

    fine for 7 months • One day: processed input but didn’t write output • No error triggered → discovered only when a team reported missing data Observability Solution Two type data pipelines • Continuous Delivery: Alert if no new orders in 1 min • Irregular Delivery: Allow gaps, alert if > threshold Other part of data pipelines • Last job run time • Last data updated time • Metadata update, but no data update Challenges: • Beware of false positives (threshold, schema, compaction) Flow Interruption Detector = Pizza Supply Alarm
  11. Skew Detector Problem (Unbalanced Pizza Orders) • 90% of orders

    = Hawaiian → workload overloaded (data on one “partition.”) ◦ Processing Skew (Capacity Bottlenecks) ◦ Inventory Skew (Imbalance) ◦ Customer Skew (Experience Risk) ◦ Decision Skew (Business Bias) Observability Solution: 1. Identifies the comparison window 2. Set tolerance threshold 3. Calculate skew a. Window-to-window or STDDEV/AVG Challenges: • Seasonality (e.g. Mango Pizza at Summer) • Communication (inf sync) • Fatality loop
  12. Lag Detector - Monitoring Latency Problem • Last week, the

    order volume suddenly increased by 30%, but ovens A and B were still busy with the previous batch. Their baking speed fell behind the pace of new incoming orders. • As a result, some customers (downstream consumers) started complaining about longer wait times and a poor dining experience. Observability Solution: 1. Define lag unit (record offset, commit number, partition timestamp) 2. Compare last available unit – last processed unit. 3. Aggregate results: ◦ MAX = worst-case lag ◦ P90 / P95 = percentile lag view 4. Beware the Average Trap → average can hide real latency issues Challenges: • Data skew (usually from partition) • Latency for business impact
  13. SLA Misses Detector Problem • Unpredictable factors make it hard

    to consistently meet the SLA of 40 minutes, causing trust and dependency issues for downstream consumers. Observability Solution: • Measure job execution time vs SLA threshold. • Batch job ◦ Tracks Start → End time. • Streaming job ◦ Use microbatch/event windows OR record-level read/write difference. • Alert when execution time exceeds SLA. Challenges: • Simple for batch, complex for streaming. • Need to handle late data vs event time separately. • SLA Miss ≠ always lag issue (e.g., skewed partition). • Both Processing Time SLA & Event Time SLA should be monitored. - Ensuring On-Time Data Delivery
  14. Dataset Tracker Observability Solution: • Build a “family tree” of

    datasets (who provides, who consumes) • Makes dependencies between datasets & teams visible Two approaches: 1. Managed Services (e.g. Databricks Unity Catalog, GCP Dataplex) – auto lineage, but limited scope 2. Custom Implementation – define inputs/outputs in pipelines or queries, connect to lineage/metadata tools (OpenLineage /datahub/ openmetadata) Challenges: • Vendor lock if only using cloud-managed tools • More custom work if pipelines have unusual tasks Problem A big customer orders 100 pizzas for a company event. When the pizzas arrive: • Some are smaller than usual. • Some taste too salty. • A few have the wrong cheese on top. No one can clearly explain which step caused the failure.
  15. Fine-Grained Tracker Observability Solution: • Fine-Grained Tracker = column /

    row level lineage. • Analyze query plans → map input columns to output columns. • Row-level: add lineage info (job_name, version, parent_lineage) into headers/metadata. Challenges: • Hard for custom code (opaque logic). • Row-level lineage not well visualized. • Must support evolution (transformations change over time). Problem A mega pizza recipe that combines more than 30 different ingredients — cheese, tomato, mushrooms, pepperoni, olives, and more. • We can track which oven produced it. But we cannot tell which ingredient came from where. • “Which ingredient came from which supplier?” • “Who is responsible for the cheese topping?” • “Which batch of tomatoes went into this slice?”
  16. Data lineage is the foundation for a new generation of

    powerful, context-aware data tools and best practices. OpenLineage enables consistent collection of lineage metadata, creating a deeper understanding of how data is produced and used. https://openlineage.io/
  17. Data Quality Design Pattern Konieczny, Bartosz. Data Engineering Design Patterns:

    Recipes for Solving the Most Common Data Engineering Problems. O'Reilly Media, Inc., 2025. Quality Enforcement • Audit-Write-Audit-Publish • Constraints Enforcer Schema Consistency • Schema Compatibility Enforcer • Schema Migrator Quality Observation • Offline Observer • Online Observer
  18. Smart Pizza & AI Takeaway • 3 Observability Pillars •

    5 Data Observability Pillars • Observability Design Patterns • Context-Aware & Intelligent Observability Traces Logs Metrics Let’s make data observable & AI trustworthy