SRE BLR Meetup 29th July
Monitoring vs. Debugging
Prathamesh Sonpatki, last9.io
1
Slide 2
Slide 2 text
SRE BLR Meetup 29th July
You -> Me-> 🌮 🍱
Prathamesh Sonpatki, last9.io
2
Slide 3
Slide 3 text
3
Slide 4
Slide 4 text
M.E.L.T.
4
Slide 5
Slide 5 text
5
Slide 6
Slide 6 text
Pillars of Obs
Monitoring vs. Debugging
6
Slide 7
Slide 7 text
7
Slide 8
Slide 8 text
Why this matters today?
• Workloads have changed
• Infra is cattle - ephemeral
• Services are dynamic
• Push to Cloud
A 3 node cluster
running 10 namespaces
with 5 deployments
with a replica set of ~3-5
with 10 config maps
emits whooping 16566 time series per minute
using the popular kube-state-metrics library
8
Slide 9
Slide 9 text
Why this matters today?
• Workloads have changed
• Infra is cattle - ephemeral
• Services are dynamic
• Push to Cloud
• Pod Metrics
• Deployment Metrics
• ReplicaSet Metrics
• StatefulSet Metrics
• DaemonSet Metrics
• Job Metrics
• Service Metrics
• Namespace Metrics
• Node Metrics
9
Outcomes we want
• To not have downtimes
• To mitigate problems quickly
• To debug a failure
• To know how the system is behaving in real time
• To co-relate an outage to a hardware failure
• To
fi
nd anomalies and patterns
• To trace a payment failure
• To
fi
nd out unknown failures before they happen
• To prevent hampering customer experience and business impact
13
Slide 14
Slide 14 text
Questions we ask
• What is wrong?
• Did we change anything?
• What do we do so this doesn’t repeat?
14
Slide 15
Slide 15 text
Answers we want
• Know
• Communicate
• Recover
• Analyse
• Debug
• Root cause
15
Slide 16
Slide 16 text
Answers we want
• System Health
• Quick Decisions
• Time
• Root Cause
• Testing
• Correctness
16
Slide 17
Slide 17 text
M.E.L.T.
201
17
Slide 18
Slide 18 text
Logs
18
Slide 19
Slide 19 text
Logs
• Can be literally anything -> Unstructured logs
• Standard programs -> Structured logs
19
Slide 20
Slide 20 text
• Getting Started ✅
• Adoption ✅
• Debugging ✅
• Relationships 🥲
Logs
• Volume 🥲
• Standardisation 🥲
• Health 🥲
• System insights 🥲
20
Slide 21
Slide 21 text
• Bene
fi
ts of aggregation
• Less volume
• Patterns, Trends, Insights
Logs2Metrics
21
Slide 22
Slide 22 text
Metrics
22
Slide 23
Slide 23 text
Metrics
• Aggregated!
• Fastest and Cheapest to understand system health
• Dimensions
23
Slide 24
Slide 24 text
• Getting Started 😐
• Adoption ✅
• Debugging 🥲
• Relationships 🥲
Metrics
• Volume ✅
• Standardisation ✅
• Health ✅
• System insights ✅
24
Slide 25
Slide 25 text
Traces
25
Slide 26
Slide 26 text
Traces
• Relationships and Directions!
• Scoped to a request/work
fl
ow
26
Slide 27
Slide 27 text
• Getting Started 😐
• Adoption 🥲
• Debugging ✅
• Relationships ✅
Traces
• Volume 😐
• Standardisation ✅
• Health 🥲
• System insights 🥲
27
• Getting Started 😐
• Adoption 🥲
• Debugging ✅
• Relationships 🥲
Events
• Volume ✅
• Standardisation 🥲
• Health ✅
• System insights ✅
30
Slide 31
Slide 31 text
Answers we want
• Know
• Communicate
• Recover
• Analyse
• Debug
• Root cause
Real Time Post Factor
31
Slide 32
Slide 32 text
32
Open Standards
Slide 33
Slide 33 text
Common Goals, Common Ground
• Continue to leverage open source innovations
• Multiple options
• No Truck factor
33
Slide 34
Slide 34 text
Common Goals, Common Ground
• Continue to leverage open source innovations
• Multiple options
• No Truck factor
34
Slide 35
Slide 35 text
OpenAPM
• Metrics for APM
• Based on open-source Prometheus, OpenMetrics and Grafana
• Support for NodeJS and Golang
• https://github.com/last9/nodejs-openapm/
35
Slide 36
Slide 36 text
OpenAPM
36
Slide 37
Slide 37 text
OpenTelemetry
37
80% of Telemetry data is unused
Slide 38
Slide 38 text
80% of Telemetry data is unused
• Yet, we store it and pay for the data that is unused!
• Slow dashboards, concurrent access woes
• No real time alerting
38
Slide 39
Slide 39 text
For high precision monitoring,
You need high precision control levers
Slide 40
Slide 40 text
Control Levers
• Treat workloads di
ff
erently
• Tiers
• Policies
• Declarative Observability
40
Slide 41
Slide 41 text
Control Levers
• Treat workloads di
ff
erently
• Tiers
• Policies
• Declarative Observability
41