Slide 1

Slide 1 text

QCon SF 2023 Monitoring vs. Debugging Prathamesh Sonpatki, last9.io 1

Slide 2

Slide 2 text

2

Slide 3

Slide 3 text

M.E.L.T. 3

Slide 4

Slide 4 text

4

Slide 5

Slide 5 text

Pillars of Obs Monitoring vs. Debugging 5

Slide 6

Slide 6 text

Why this matters today? • Workloads have changed • Infra is cattle - ephemeral • Services are dynamic • Push to Cloud A 3 node cluster running 10 namespaces with 5 deployments with a replica set of ~3-5 with 10 config maps emits whooping 16566 time series per minute using the popular kube-state-metrics library 6

Slide 7

Slide 7 text

Why this matters today? • Volume • Velocity • Variety • Complexity 7

Slide 8

Slide 8 text

Why this matters today? • Volume • Velocity • Variety • Complexity •C.O.S.T. - Cardinality - Operations - Scale - Toil 8

Slide 9

Slide 9 text

9

Slide 10

Slide 10 text

Outcomes we want • To not have downtimes • To mitigate problems quickly • To debug a failure • To know how the system is behaving in real time • To co-relate an outage to a hardware failure • To fi nd anomalies and patterns • To trace a payment failure • To fi nd out unknown failures before they happen • To prevent hampering customer experience and business impact 10

Slide 11

Slide 11 text

Questions we ask • What is wrong? • Did we change anything? • What do we do so this doesn’t repeat? 11

Slide 12

Slide 12 text

Answers we want • Know • Communicate • Recover • Analyse • Debug • Root cause 12

Slide 13

Slide 13 text

Answers we want • System Health • Quick Decisions • Time • Root Cause • Testing • Correctness 13

Slide 14

Slide 14 text

• Getting Started ✅ • Adoption ✅ • Debugging ✅ • Relationships 🥲 Logs • Volume 🥲 • Standardisation 🥲 • Health 🥲 • System insights 🥲 14

Slide 15

Slide 15 text

• Getting Started 😐 • Adoption ✅ • Debugging 🥲 • Relationships 🥲 Metrics • Volume ✅ • Standardisation ✅ • Health ✅ • System insights ✅ 15

Slide 16

Slide 16 text

• Getting Started 😐 • Adoption 🥲 • Debugging ✅ • Relationships ✅ Traces • Volume 😐 • Standardisation ✅ • Health 🥲 • System insights 🥲 16

Slide 17

Slide 17 text

• Getting Started 😐 • Adoption 🥲 • Debugging ✅ • Relationships 🥲 Events • Volume ✅ • Standardisation 🥲 • Health ✅ • System insights ✅ 17

Slide 18

Slide 18 text

Answers we want • Know • Communicate • Recover • Analyse • Debug • Root cause Real Time Post Factor 18

Slide 19

Slide 19 text

For high precision monitoring, You need high precision control levers

Slide 20

Slide 20 text

Control Levers • Treat workloads di ff erently • Tiers • Policies • Declarative Observability 20

Slide 21

Slide 21 text

last9.io/blog Join the Discord Community Thank You! 21