Observability 🔎 in Data Ingestion📦

Observability🔎 in data ingestion📦 Nancy Chauhan @_nancychauhan

Why attend this Talk ? 🤔 • Data -> Essential
part of all organisation 🛠 • Data ingestion, transformation and storage of huge data sets 🔥🔥🔥 • Observability gives greater control over system 🧐 • Let’s talk about Observability in Data ingestion 🔎

Nancy Chauhan @_nancychauhan 󰠁 Developer and Tech Writer 🍊✏ Currently
contributing at Gitpod 🛠 Previously worked at Zeotap, Grofers developing solutions for software reliability. 🧡 I Love open source ✏ I love writing about tech at nancychauhan.in

What is data ingestion📦

🛠 Process of preparing data to be stored in clean
production environment 📦 Focus 🎯 Get data into any systems that require data in a particular structure/format for operation use of the data downstream It addresses the need to process huge amount of data 🚀 Main use case of Data ingestion is for Business Analytics 📈📊

data ingestion

What is observability🔎

Observability answers Why x is broken?

Four Pillars of Observability Engineering team’s charter: • Monitoring 📈
• Alerting/visualization ⏰ • Distributed systems tracing infrastructure 🔎 • Log aggregation/analytics ⚙ (Source: Twitter’s tech blog)

Observability of data pipelines helps determine completeness, accuracy, efficiency and
consistency for any data uploaded through either batch or streaming sources.

Why is observability important in data ingestion • Large volumes
of data can not always be 100% error-free.🐛 • All plans are futile if unreliable data is ingested, transformed and pushed downstream. • If there is failure, your data pipeline should handle it gracefully, it should log so that we can take action ⚙

Why is observability important in data ingestion • Data pipelines
are somewhat interconnected and non-intuitive ⛓ • Internal and External data could become faulty, inconsistent, inaccurate, missing, change abruptly eventually affect the correctness of other dependent data assets.

Why is observability important in data ingestion • Delayed ingestion
means delayed business decisions -> Impact revenue 💵 • If data is incorrectly processed, you can take wrong decisions.

“Data Observability space will be the key for corporate transformation
to a data-based approach. Hence, Data Observability is expected to see massive growth.”

This is to say that without visibility into data pipelines
and infrastructures, data and analytics teams would be merely flying blind (i.e., they can’t fully understand the health of the pipeline and/or understand what’s happening between data inputs and outputs).

How can we achieve it Freshness: Did all data arrive
and is up to date? Volume: Are the data tables complete and correct? Distribution: Is data reliable? Do data values fall within an acceptable range Lineage: Who is generating data? Who will use the data for making business decisions. Schema: Is the data in correct format? Did the data schema change? Who made changes? How can we correct it?

How can we achieve it Build Custom solution 🛠 ->
Capture errors 🐛, build control system, single access/dashboard to show all failures, show % of errors. -> Create transparency in data pipelines ✨ -> Enable alerting: Take quick action 󰝋

Metrics • Volume of data processed per single run of
pipeline • Total no of records processed • Total no of failures • % of failures • Duration of pipeline • Count of unique data (deduplication) • Data related error (error message -> getting integer in string column) • Data ingestion rate and efficiency • Alerting and Visualisation • Error rate • Error messages

Improve observability using Stackdriver metrics programmatically 💻

Calculate Total no of records for streaming source • Use
case: Get the count of messages published in a pub/sub topic for streaming sources to calculate ingestion efficiency % • One of the quickest ways I found was using Stackdriver. 🚀 • Demo 💻 https://github.com/Nancy-Chauhan/stackdriver-example • Here is the Blog link: https://nancy-chauhan.medium.com/improve-observability-us ing-stackdriver-metrics-programmatically-a29bfd7051e0

Thank you

Observability 🔎 in Data Ingestion📦

Observability 🔎 in Data Ingestion📦

Nancy Chauhan

More Decks by Nancy Chauhan

Other Decks in Technology

Featured

Transcript