Design Foundational Data Engineering Observability

Slide theme made from ChatGPT/Gemini Design Foundational Data (Engineering) Observability
Shuhsi Lin 20250907 The Smart Pizza & AI Way

Find me on sciwork member Working in Smart Manufacturing &
AI With data and people About Me Interested in • Agile/Engineering Culture/ Developer Experience • Team Coaching • Data Engineering

Agenda 2 Takeaway + Q&A Data Observability Design Patterns The
Evolution of Data Observability How to Do Stories in Smart Pizza & AI Common Data Engineering Challenges From Monitoring to Observability Observability and Data Observability Start with Basic

customers/day

Inspired from “ ( Joe Reis and Matt Housley, “Chapter
1,” in Fundamentals of Data Engineering (O’Reilly Media, 2022))” ETL/ELT

Simplistic Data Flow 6 Data Store/Application B Acquire/Ingest Data Process
and Analyzed Data Data Store/Application A • Data movement as ﬂow • Moving data content from A to B

Many Flow-like Data in a Real World

Data Engineering Challenges How to manage pipelines eﬀectively • Complex
data pipelines • Inability to centrally view - Limited data asset discoverability - Error detection and root cause analysis - Scattered monitoring and time-consuming troubleshooting - Optimization and monitoring workloads at scale - Ineﬃcient pipelines that negatively

From Monitoring to Observability Why Traditional Monitoring is Not Enough?
. Monitoring Observability WHERE the issue is WHY it happened • Measure and report speciﬁc metrics in a system. • Reactive – collect data to identify abnormal systems. • WHEN and WHAT did the system error occur? • Smart Pizza & AI Example: ◦ Checking only if the pizza oven is on and the thermometer is working. ◦ Alerts when 'Order Sync Task failed' or 'Order DB CPU > 90%'. • Collect metrics, events, logs, and traces across distributed systems. • Proactive – investigate root causes of abnormal systems. • WHY and HOW did the system error occur? • Smart Pizza & AI Example: ◦ A smart oven that not only tracks temperature but also analyzes pizza color, dough rise, and past baking data to diagnose why it burned. ◦ The accuracy dropped due to a schema change in an upstream API that nullified the pizza_type field.

3 Pillars of Observability 5 Pillars of Data Observability ◦
Logs Metrics Traces

Three Key Focus Areas of Data Observability Pipeline Data •
Focus ◦ Hardware & services running pipelines • Metrics ◦ CPU, memory, disk, network • What we want to know ◦ Is the ML cluster (training AI models) overloaded? Infrastructure • Focus ◦ Data transfer & processing flow • Metrics ◦ Task duration, success/failure rate, retries • What we want to know ◦ Did the daily ETL for orders finish on time • Focus ◦ Data content, structure & quality • Metrics ◦ Freshness, volume, distribution, schema, lineage • What we want to know ◦ Are order fields valid? Any anomaly in new order volume?

12 Data Store/Application B Acquire/Ingest Data Process and Analyzed Data
Data Store/Application A Data Observability Data Infrastructure (CPU/Memory/Disk/network/) Pipeline (Task duration/Retries Success & Failure Rate) Freshness, volume, distribution, schema, lineage

Data Observability Design Pattern Data Detectors Flow Interruption Detector Skew
Detector Time Detectors Lag Detector SLA Misses Detector Lineage Trackers Dataset Tracker Fine-Grained Tracker Konieczny, Bartosz. Data Engineering Design Patterns: Recipes for Solving the Most Common Data Engineering Problems. O'Reilly Media, Inc., 2025.

Flow Interruption Detector Problem • Real-time order sync job ran
fine for 7 months • One day: processed input but didn’t write output • No error triggered → discovered only when a team reported missing data Observability Solution Two type data pipelines • Continuous Delivery: Alert if no new orders in 1 min • Irregular Delivery: Allow gaps, alert if > threshold Other part of data pipelines • Last job run time • Last data updated time • Metadata update, but no data update Challenges: • Beware of false positives (threshold, schema, compaction) Flow Interruption Detector = Pizza Supply Alarm

Skew Detector Problem (Unbalanced Pizza Orders) • 90% of orders
= Hawaiian → workload overloaded (data on one “partition.”) ◦ Processing Skew (Capacity Bottlenecks) ◦ Inventory Skew (Imbalance) ◦ Customer Skew (Experience Risk) ◦ Decision Skew (Business Bias) Observability Solution: 1. Identiﬁes the comparison window 2. Set tolerance threshold 3. Calculate skew a. Window-to-window or STDDEV/AVG Challenges: • Seasonality (e.g. Mango Pizza at Summer) • Communication (inf sync) • Fatality loop

Lag Detector - Monitoring Latency Problem • Last week, the
order volume suddenly increased by 30%, but ovens A and B were still busy with the previous batch. Their baking speed fell behind the pace of new incoming orders. • As a result, some customers (downstream consumers) started complaining about longer wait times and a poor dining experience. Observability Solution: 1. Define lag unit (record offset, commit number, partition timestamp) 2. Compare last available unit – last processed unit. 3. Aggregate results: ◦ MAX = worst-case lag ◦ P90 / P95 = percentile lag view 4. Beware the Average Trap → average can hide real latency issues Challenges: • Data skew (usually from partition) • Latency for business impact

SLA Misses Detector Problem • Unpredictable factors make it hard
to consistently meet the SLA of 40 minutes, causing trust and dependency issues for downstream consumers. Observability Solution: • Measure job execution time vs SLA threshold. • Batch job ◦ Tracks Start → End time. • Streaming job ◦ Use microbatch/event windows OR record-level read/write diﬀerence. • Alert when execution time exceeds SLA. Challenges: • Simple for batch, complex for streaming. • Need to handle late data vs event time separately. • SLA Miss ≠ always lag issue (e.g., skewed partition). • Both Processing Time SLA & Event Time SLA should be monitored. - Ensuring On-Time Data Delivery

Dataset Tracker Observability Solution: • Build a “family tree” of
datasets (who provides, who consumes) • Makes dependencies between datasets & teams visible Two approaches: 1. Managed Services (e.g. Databricks Unity Catalog, GCP Dataplex) – auto lineage, but limited scope 2. Custom Implementation – define inputs/outputs in pipelines or queries, connect to lineage/metadata tools (OpenLineage /datahub/ openmetadata) Challenges: • Vendor lock if only using cloud-managed tools • More custom work if pipelines have unusual tasks Problem A big customer orders 100 pizzas for a company event. When the pizzas arrive: • Some are smaller than usual. • Some taste too salty. • A few have the wrong cheese on top. No one can clearly explain which step caused the failure.

Fine-Grained Tracker Observability Solution: • Fine-Grained Tracker = column /
row level lineage. • Analyze query plans → map input columns to output columns. • Row-level: add lineage info (job_name, version, parent_lineage) into headers/metadata. Challenges: • Hard for custom code (opaque logic). • Row-level lineage not well visualized. • Must support evolution (transformations change over time). Problem A mega pizza recipe that combines more than 30 different ingredients — cheese, tomato, mushrooms, pepperoni, olives, and more. • We can track which oven produced it. But we cannot tell which ingredient came from where. • “Which ingredient came from which supplier?” • “Who is responsible for the cheese topping?” • “Which batch of tomatoes went into this slice?”

https://blog.open-metadata.org/why-openmetadata-is-the-right-choice-for-you-59e329163cac https://docs.open-metadata.org/latest/developers/architecture OpenMetadata is a uniﬁed platform for discovery, observability,
and governance powered by a central metadata repository, in-depth lineage, and seamless team collaboration.

Data lineage is the foundation for a new generation of
powerful, context-aware data tools and best practices. OpenLineage enables consistent collection of lineage metadata, creating a deeper understanding of how data is produced and used. https://openlineage.io/

Data Quality Design Pattern Konieczny, Bartosz. Data Engineering Design Patterns:
Recipes for Solving the Most Common Data Engineering Problems. O'Reilly Media, Inc., 2025. Quality Enforcement • Audit-Write-Audit-Publish • Constraints Enforcer Schema Consistency • Schema Compatibility Enforcer • Schema Migrator Quality Observation • Offline Observer • Online Observer

Observability 1.0 Observability 2.0 Observability 3.0 OpenTelemetry + OTLP Prometheus+PromQL、Loki+LogQL、
Tempo+TraceQL AI + Observability

From Monitoring to Intelligent Agents

https://github.com/hueiyuan/2024-ithome-sre-conference-slide/tree/main

Context-Aware & Intelligent Observability https://blog.open-metadata.org/introducing-the-model-context-protocol-mcp-in-openmetadata-e757385f4fb2 Model Context Protocol (MCP) in
OpenMetadata

Smart Pizza & AI Takeaway • 3 Observability Pillars •
5 Data Observability Pillars • Observability Design Patterns • Context-Aware & Intelligent Observability Traces Logs Metrics Let’s make data observable & AI trustworthy

Let’s make data observable & AI trustworthy

Design Foundational Data Engineering Observability

Design Foundational Data Engineering Observability

suci

More Decks by suci

Other Decks in Programming

Featured

Transcript

Slide theme made from ChatGPT/Gemini Design Foundational Data (Engineering) Observability

Find me on sciwork member Working in Smart Manufacturing &

Agenda 2 Takeaway + Q&A Data Observability Design Patterns The

customers/day

Inspired from “ ( Joe Reis and Matt Housley, “Chapter

Simplistic Data Flow 6 Data Store/Application B Acquire/Ingest Data Process

Many Flow-like Data in a Real World

Data Engineering Challenges How to manage pipelines eﬀectively • Complex

From Monitoring to Observability Why Traditional Monitoring is Not Enough?

3 Pillars of Observability 5 Pillars of Data Observability ◦

Three Key Focus Areas of Data Observability Pipeline Data •

12 Data Store/Application B Acquire/Ingest Data Process and Analyzed Data

Data Observability Design Pattern Data Detectors Flow Interruption Detector Skew

Flow Interruption Detector Problem • Real-time order sync job ran

Skew Detector Problem (Unbalanced Pizza Orders) • 90% of orders

Lag Detector - Monitoring Latency Problem • Last week, the

SLA Misses Detector Problem • Unpredictable factors make it hard

Dataset Tracker Observability Solution: • Build a “family tree” of

Fine-Grained Tracker Observability Solution: • Fine-Grained Tracker = column /

https://blog.open-metadata.org/why-openmetadata-is-the-right-choice-for-you-59e329163cac https://docs.open-metadata.org/latest/developers/architecture OpenMetadata is a uniﬁed platform for discovery, observability,

Data lineage is the foundation for a new generation of

Data Quality Design Pattern Konieczny, Bartosz. Data Engineering Design Patterns:

Observability 1.0 Observability 2.0 Observability 3.0 OpenTelemetry + OTLP Prometheus+PromQL、Loki+LogQL、

From Monitoring to Intelligent Agents

https://github.com/hueiyuan/2024-ithome-sre-conference-slide/tree/main

Context-Aware & Intelligent Observability https://blog.open-metadata.org/introducing-the-model-context-protocol-mcp-in-openmetadata-e757385f4fb2 Model Context Protocol (MCP) in

Smart Pizza & AI Takeaway • 3 Observability Pillars •

Let’s make data observable & AI trustworthy