Monitoring vs. Debugging - SRE BLR Meetup

Slide 1

Slide 1 text

SRE BLR Meetup 29th July Monitoring vs. Debugging Prathamesh Sonpatki, last9.io 1

Slide 2

Slide 2 text

SRE BLR Meetup 29th July You -> Me-> 🌮 🍱 Prathamesh Sonpatki, last9.io 2

Slide 3

Slide 3 text

Slide 4

Slide 4 text

M.E.L.T. 4

Slide 5

Slide 5 text

Slide 6

Slide 6 text

Pillars of Obs Monitoring vs. Debugging 6

Slide 7

Slide 7 text

Slide 8

Slide 8 text

Why this matters today? • Workloads have changed • Infra is cattle - ephemeral • Services are dynamic • Push to Cloud A 3 node cluster running 10 namespaces with 5 deployments with a replica set of ~3-5 with 10 config maps emits whooping 16566 time series per minute using the popular kube-state-metrics library 8

Slide 9

Slide 9 text

Why this matters today? • Workloads have changed • Infra is cattle - ephemeral • Services are dynamic • Push to Cloud • Pod Metrics • Deployment Metrics • ReplicaSet Metrics • StatefulSet Metrics • DaemonSet Metrics • Job Metrics • Service Metrics • Namespace Metrics • Node Metrics 9

Slide 10

Slide 10 text

Why this matters today? • Volume • Velocity • Variety • Complexity 10

Slide 11

Slide 11 text

Why this matters today? • Volume • Velocity • Variety • Complexity •C.O.S.T. - Cardinality - Operations - Scale - Toil 11

Slide 12

Slide 12 text

Slide 13

Slide 13 text

Outcomes we want • To not have downtimes • To mitigate problems quickly • To debug a failure • To know how the system is behaving in real time • To co-relate an outage to a hardware failure • To fi nd anomalies and patterns • To trace a payment failure • To fi nd out unknown failures before they happen • To prevent hampering customer experience and business impact 13

Slide 14

Slide 14 text

Questions we ask • What is wrong? • Did we change anything? • What do we do so this doesn’t repeat? 14

Slide 15

Slide 15 text

Answers we want • Know • Communicate • Recover • Analyse • Debug • Root cause 15

Slide 16

Slide 16 text

Answers we want • System Health • Quick Decisions • Time • Root Cause • Testing • Correctness 16

Slide 17

Slide 17 text

M.E.L.T. 201 17

Slide 18

Slide 18 text

Logs 18

Slide 19

Slide 19 text

Logs • Can be literally anything -> Unstructured logs • Standard programs -> Structured logs 19

Slide 20

Slide 20 text

• Getting Started ✅ • Adoption ✅ • Debugging ✅ • Relationships 🥲 Logs • Volume 🥲 • Standardisation 🥲 • Health 🥲 • System insights 🥲 20

Slide 21

Slide 21 text

• Bene fi ts of aggregation • Less volume • Patterns, Trends, Insights Logs2Metrics 21

Slide 22

Slide 22 text

Metrics 22

Slide 23

Slide 23 text

Metrics • Aggregated! • Fastest and Cheapest to understand system health • Dimensions 23

Slide 24

Slide 24 text

• Getting Started 😐 • Adoption ✅ • Debugging 🥲 • Relationships 🥲 Metrics • Volume ✅ • Standardisation ✅ • Health ✅ • System insights ✅ 24

Slide 25

Slide 25 text

Traces 25

Slide 26

Slide 26 text

Traces • Relationships and Directions! • Scoped to a request/work fl ow 26

Slide 27

Slide 27 text

• Getting Started 😐 • Adoption 🥲 • Debugging ✅ • Relationships ✅ Traces • Volume 😐 • Standardisation ✅ • Health 🥲 • System insights 🥲 27

Slide 28

Slide 28 text

Events 28

Slide 29

Slide 29 text

Events • Structured logs? • Schema based? • Domain Events • Easier to adopt? • Can unlock co-relation • Dimensionality 29

Slide 30

Slide 30 text

• Getting Started 😐 • Adoption 🥲 • Debugging ✅ • Relationships 🥲 Events • Volume ✅ • Standardisation 🥲 • Health ✅ • System insights ✅ 30

Slide 31

Slide 31 text

Answers we want • Know • Communicate • Recover • Analyse • Debug • Root cause Real Time Post Factor 31

Slide 32

Slide 32 text

32 Open Standards

Slide 33

Slide 33 text

Common Goals, Common Ground • Continue to leverage open source innovations • Multiple options • No Truck factor 33

Slide 34

Slide 34 text

Common Goals, Common Ground • Continue to leverage open source innovations • Multiple options • No Truck factor 34

Slide 35

Slide 35 text

OpenAPM • Metrics for APM • Based on open-source Prometheus, OpenMetrics and Grafana • Support for NodeJS and Golang • https://github.com/last9/nodejs-openapm/ 35

Slide 36

Slide 36 text

OpenAPM 36

Slide 37

Slide 37 text

OpenTelemetry 37 80% of Telemetry data is unused

Slide 38

Slide 38 text

80% of Telemetry data is unused • Yet, we store it and pay for the data that is unused! • Slow dashboards, concurrent access woes • No real time alerting 38

Slide 39

Slide 39 text

For high precision monitoring, You need high precision control levers

Slide 40

Slide 40 text

Control Levers • Treat workloads di ff erently • Tiers • Policies • Declarative Observability 40

Slide 41

Slide 41 text

Control Levers • Treat workloads di ff erently • Tiers • Policies • Declarative Observability 41