AllDayDevops 2023
Breaking down the Pillars
of Observability:
From data
to outcomes
Prathamesh Sonpatki, last9.io
1
Slide 2
Slide 2 text
2
Slide 3
Slide 3 text
M.E.L.T.
3
Slide 4
Slide 4 text
Pillars of Obs
Pillars of Observability
Metrics
Events
Logs
Traces
4
Slide 5
Slide 5 text
5
Slide 6
Slide 6 text
Why this matters today?
• Workloads have changed
• Infra is cattle - ephemeral
• Services are dynamic
• Cloud Native Environments
A 3 node cluster
running 10 namespaces
with 5 deployments
with a replica set of ~3-5
with 10 config maps
emits whooping 16566 time series per minute
using the popular kube-state-metrics library
6
Slide 7
Slide 7 text
Why this matters today?
• Workloads have changed
• Infra is cattle - ephemeral
• Services are dynamic
• Cloud Native Environments
• Pod Metrics
• Deployment Metrics
• ReplicaSet Metrics
• StatefulSet Metrics
• DaemonSet Metrics
• Job Metrics
• Service Metrics
• Namespace Metrics
• Node Metrics
7
Outcomes we want
• To not have downtimes
• To mitigate problems quickly
• To debug a failure
• To know how the system is behaving in real time
• To co-relate an outage to a hardware failure
• To
fi
nd anomalies and patterns
• To trace a payment failure
• To
fi
nd out unknown failures before they happen
• To prevent hampering customer experience and business impact
11
Slide 12
Slide 12 text
Questions we ask
• What is wrong?
• Did we change anything?
• What do we do so this doesn’t repeat?
12
Slide 13
Slide 13 text
Answers we want
• Know
• Communicate
• Recover
• Analyse
• Debug
• Root cause
13
Slide 14
Slide 14 text
Answers we want
• System Health
• Quick Decisions
• Time
• Root Cause
• Testing
• Correctness
14
Slide 15
Slide 15 text
M.E.L.T.
201
15
Slide 16
Slide 16 text
Logs
16
Slide 17
Slide 17 text
Logs
• Can be literally anything -> Unstructured logs
• Standard programs -> Structured logs
17
Slide 18
Slide 18 text
• Getting Started ✅
• Adoption ✅
• Debugging ✅
• Relationships 🥲
Logs
• Volume 🥲
• Standardisation 🥲
• Health 🥲
• System insights 🥲
18
Slide 19
Slide 19 text
Metrics
19
Slide 20
Slide 20 text
Metrics
• Aggregated!
• Fastest and Cheapest to understand system health
• Dimensions
20
Slide 21
Slide 21 text
• Getting Started 😐
• Adoption ✅
• Debugging 🥲
• Relationships 🥲
Metrics
• Volume 😐
• Standardisation ✅
• Health ✅
• System insights ✅
21
Slide 22
Slide 22 text
Traces
22
Slide 23
Slide 23 text
Traces
• Relationships and Directions!
• Scoped to a request/work
fl
ow
23
Slide 24
Slide 24 text
• Getting Started 😐
• Adoption 🥲
• Debugging ✅
• Relationships ✅
Traces
• Volume 😐
• Standardisation ✅
• Health 🥲
• System insights 🥲
24
• Getting Started 😐
• Adoption 🥲
• Debugging ✅
• Relationships 🥲
Events
• Volume ✅
• Standardisation 🥲
• Health ✅
• System insights ✅
27
Slide 28
Slide 28 text
Answers we want
• Know
• Communicate
• Recover
• Analyse
• Debug
• Root cause
Real Time Post Factor
28
Slide 29
Slide 29 text
Answers we want
• Know
• Communicate
• Recover
• Analyse
• Debug
• Root cause
SRE/DevOps Programmer/Developers
29
Slide 30
Slide 30 text
OpenTelemetry
30
80% of Telemetry data is unused
Slide 31
Slide 31 text
80% of Telemetry data is unused
• Yet, we store it and pay for the data that is unused!
• Slow dashboards, concurrent access woes
• No real time alerting
• Cost vs. Performance vs. Retention tradeo
ff
s
31
Slide 32
Slide 32 text
For high precision observability,
You need high precision control levers
Slide 33
Slide 33 text
Control Levers
• Treat workloads di
ff
erently
• Fast vs. Slow Data Tiers
• Policies
• Declarative Observability
33
Slide 34
Slide 34 text
Levitate - A Managed Time
Series Data Warehouse
Thank You!
34