Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Monitoring vs. Debugging

Monitoring vs. Debugging

Prathamesh Sonpatki

October 01, 2023
Tweet

More Decks by Prathamesh Sonpatki

Other Decks in Technology

Transcript

  1. QCon SF 2023
    Monitoring vs. Debugging
    Prathamesh Sonpatki, last9.io
    1

    View full-size slide

  2. Pillars of Obs
    Monitoring vs. Debugging
    5

    View full-size slide

  3. Why this matters today?
    • Workloads have changed

    • Infra is cattle - ephemeral

    • Services are dynamic

    • Push to Cloud
    A 3 node cluster
    running 10 namespaces
    with 5 deployments
    with a replica set of ~3-5
    with 10 config maps
    emits whooping 16566 time series per minute
    using the popular kube-state-metrics library
    6

    View full-size slide

  4. Why this matters today?
    • Volume

    • Velocity

    • Variety

    • Complexity
    7

    View full-size slide

  5. Why this matters today?
    • Volume

    • Velocity

    • Variety

    • Complexity

    •C.O.S.T.
    - Cardinality

    - Operations

    - Scale

    - Toil
    8

    View full-size slide

  6. Outcomes we want
    • To not have downtimes
    • To mitigate problems quickly
    • To debug a failure
    • To know how the system is behaving in real time
    • To co-relate an outage to a hardware failure
    • To
    fi
    nd anomalies and patterns
    • To trace a payment failure
    • To
    fi
    nd out unknown failures before they happen
    • To prevent hampering customer experience and business impact
    10

    View full-size slide

  7. Questions we ask
    • What is wrong?

    • Did we change anything?

    • What do we do so this doesn’t repeat?
    11

    View full-size slide

  8. Answers we want
    • Know

    • Communicate

    • Recover
    • Analyse

    • Debug

    • Root cause
    12

    View full-size slide

  9. Answers we want
    • System Health

    • Quick Decisions

    • Time
    • Root Cause

    • Testing

    • Correctness
    13

    View full-size slide

  10. • Getting Started ✅

    • Adoption ✅

    • Debugging ✅

    • Relationships 🥲
    Logs
    • Volume 🥲

    • Standardisation 🥲

    • Health 🥲

    • System insights 🥲
    14

    View full-size slide

  11. • Getting Started 😐

    • Adoption ✅

    • Debugging 🥲

    • Relationships 🥲
    Metrics
    • Volume ✅

    • Standardisation ✅

    • Health ✅

    • System insights ✅
    15

    View full-size slide

  12. • Getting Started 😐

    • Adoption 🥲

    • Debugging ✅

    • Relationships ✅
    Traces
    • Volume 😐

    • Standardisation ✅

    • Health 🥲

    • System insights 🥲
    16

    View full-size slide

  13. • Getting Started 😐

    • Adoption 🥲

    • Debugging ✅

    • Relationships 🥲
    Events
    • Volume ✅

    • Standardisation 🥲

    • Health ✅

    • System insights ✅
    17

    View full-size slide

  14. Answers we want
    • Know

    • Communicate

    • Recover
    • Analyse

    • Debug

    • Root cause
    Real Time Post Factor
    18

    View full-size slide

  15. For high precision monitoring,
    You need high precision control levers

    View full-size slide

  16. Control Levers
    • Treat workloads di
    ff
    erently

    • Tiers

    • Policies

    • Declarative Observability
    20

    View full-size slide

  17. last9.io/blog
    Join the Discord Community
    Thank You!
    21

    View full-size slide