Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Monitoring vs. Debugging

Sponsored · SiteGround - Reliable hosting with speed, security, and support you can count on.

Monitoring vs. Debugging

Avatar for Prathamesh Sonpatki

Prathamesh Sonpatki

October 01, 2023
Tweet

More Decks by Prathamesh Sonpatki

Other Decks in Technology

Transcript

  1. 2

  2. 4

  3. Why this matters today? • Workloads have changed • Infra

    is cattle - ephemeral • Services are dynamic • Push to Cloud A 3 node cluster running 10 namespaces with 5 deployments with a replica set of ~3-5 with 10 config maps emits whooping 16566 time series per minute using the popular kube-state-metrics library 6
  4. Why this matters today? • Volume • Velocity • Variety

    • Complexity •C.O.S.T. - Cardinality - Operations - Scale - Toil 8
  5. 9

  6. Outcomes we want • To not have downtimes • To

    mitigate problems quickly • To debug a failure • To know how the system is behaving in real time • To co-relate an outage to a hardware failure • To fi nd anomalies and patterns • To trace a payment failure • To fi nd out unknown failures before they happen • To prevent hampering customer experience and business impact 10
  7. Questions we ask • What is wrong? • Did we

    change anything? • What do we do so this doesn’t repeat? 11
  8. Answers we want • Know • Communicate • Recover •

    Analyse • Debug • Root cause 12
  9. Answers we want • System Health • Quick Decisions •

    Time • Root Cause • Testing • Correctness 13
  10. • Getting Started ✅ • Adoption ✅ • Debugging ✅

    • Relationships 🥲 Logs • Volume 🥲 • Standardisation 🥲 • Health 🥲 • System insights 🥲 14
  11. • Getting Started 😐 • Adoption ✅ • Debugging 🥲

    • Relationships 🥲 Metrics • Volume ✅ • Standardisation ✅ • Health ✅ • System insights ✅ 15
  12. • Getting Started 😐 • Adoption 🥲 • Debugging ✅

    • Relationships ✅ Traces • Volume 😐 • Standardisation ✅ • Health 🥲 • System insights 🥲 16
  13. • Getting Started 😐 • Adoption 🥲 • Debugging ✅

    • Relationships 🥲 Events • Volume ✅ • Standardisation 🥲 • Health ✅ • System insights ✅ 17
  14. Answers we want • Know • Communicate • Recover •

    Analyse • Debug • Root cause Real Time Post Factor 18
  15. Control Levers • Treat workloads di ff erently • Tiers

    • Policies • Declarative Observability 20