Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Observability 🔎 in Data Ingestion📦

Observability 🔎 in Data Ingestion📦

Nancy Chauhan

June 12, 2022
Tweet

More Decks by Nancy Chauhan

Other Decks in Technology

Transcript

  1. Why attend this Talk ? 🤔 • Data -> Essential

    part of all organisation 🛠 • Data ingestion, transformation and storage of huge data sets 🔥🔥🔥 • Observability gives greater control over system 🧐 • Let’s talk about Observability in Data ingestion 🔎
  2. Nancy Chauhan @_nancychauhan 󰠁 Developer and Tech Writer 🍊✏ Currently

    contributing at Gitpod 🛠 Previously worked at Zeotap, Grofers developing solutions for software reliability. 🧡 I Love open source ✏ I love writing about tech at nancychauhan.in
  3. 🛠 Process of preparing data to be stored in clean

    production environment 📦 Focus 🎯 Get data into any systems that require data in a particular structure/format for operation use of the data downstream It addresses the need to process huge amount of data 🚀 Main use case of Data ingestion is for Business Analytics 📈📊
  4. Four Pillars of Observability Engineering team’s charter: • Monitoring 📈

    • Alerting/visualization ⏰ • Distributed systems tracing infrastructure 🔎 • Log aggregation/analytics ⚙ (Source: Twitter’s tech blog)
  5. Observability of data pipelines helps determine completeness, accuracy, efficiency and

    consistency for any data uploaded through either batch or streaming sources.
  6. Why is observability important in data ingestion • Large volumes

    of data can not always be 100% error-free.🐛 • All plans are futile if unreliable data is ingested, transformed and pushed downstream. • If there is failure, your data pipeline should handle it gracefully, it should log so that we can take action ⚙
  7. Why is observability important in data ingestion • Data pipelines

    are somewhat interconnected and non-intuitive ⛓ • Internal and External data could become faulty, inconsistent, inaccurate, missing, change abruptly eventually affect the correctness of other dependent data assets.
  8. Why is observability important in data ingestion • Delayed ingestion

    means delayed business decisions -> Impact revenue 💵 • If data is incorrectly processed, you can take wrong decisions.
  9. “Data Observability space will be the key for corporate transformation

    to a data-based approach. Hence, Data Observability is expected to see massive growth.”
  10. This is to say that without visibility into data pipelines

    and infrastructures, data and analytics teams would be merely flying blind (i.e., they can’t fully understand the health of the pipeline and/or understand what’s happening between data inputs and outputs).
  11. How can we achieve it Freshness: Did all data arrive

    and is up to date? Volume: Are the data tables complete and correct? Distribution: Is data reliable? Do data values fall within an acceptable range Lineage: Who is generating data? Who will use the data for making business decisions. Schema: Is the data in correct format? Did the data schema change? Who made changes? How can we correct it?
  12. How can we achieve it Build Custom solution 🛠 ->

    Capture errors 🐛, build control system, single access/dashboard to show all failures, show % of errors. -> Create transparency in data pipelines ✨ -> Enable alerting: Take quick action 󰝋
  13. Metrics • Volume of data processed per single run of

    pipeline • Total no of records processed • Total no of failures • % of failures • Duration of pipeline • Count of unique data (deduplication) • Data related error (error message -> getting integer in string column) • Data ingestion rate and efficiency • Alerting and Visualisation • Error rate • Error messages
  14. Calculate Total no of records for streaming source • Use

    case: Get the count of messages published in a pub/sub topic for streaming sources to calculate ingestion efficiency % • One of the quickest ways I found was using Stackdriver. 🚀 • Demo 💻 https://github.com/Nancy-Chauhan/stackdriver-example • Here is the Blog link: https://nancy-chauhan.medium.com/improve-observability-us ing-stackdriver-metrics-programmatically-a29bfd7051e0