Breaking down the Pillars of Observability: From data to outcomes

AllDayDevops 2023 Breaking down the Pillars of Observability: From data
to outcomes Prathamesh Sonpatki, last9.io 1

M.E.L.T. 3

Pillars of Obs Pillars of Observability Metrics Events Logs Traces
4

Why this matters today? • Workloads have changed • Infra
is cattle - ephemeral • Services are dynamic • Cloud Native Environments A 3 node cluster running 10 namespaces with 5 deployments with a replica set of ~3-5 with 10 config maps emits whooping 16566 time series per minute using the popular kube-state-metrics library 6

Why this matters today? • Workloads have changed • Infra
is cattle - ephemeral • Services are dynamic • Cloud Native Environments • Pod Metrics • Deployment Metrics • ReplicaSet Metrics • StatefulSet Metrics • DaemonSet Metrics • Job Metrics • Service Metrics • Namespace Metrics • Node Metrics 7

Why this matters today? • Volume • Velocity • Variety
• Complexity 8

Why this matters today? • Volume • Velocity • Variety
• Complexity •C.O.S.T. - Cardinality - Operations - Scale - Toil 9

Outcomes we want • To not have downtimes • To
mitigate problems quickly • To debug a failure • To know how the system is behaving in real time • To co-relate an outage to a hardware failure • To fi nd anomalies and patterns • To trace a payment failure • To fi nd out unknown failures before they happen • To prevent hampering customer experience and business impact 11

Questions we ask • What is wrong? • Did we
change anything? • What do we do so this doesn’t repeat? 12

Answers we want • Know • Communicate • Recover •
Analyse • Debug • Root cause 13

Answers we want • System Health • Quick Decisions •
Time • Root Cause • Testing • Correctness 14

M.E.L.T. 201 15

Logs 16

Logs • Can be literally anything -> Unstructured logs •
Standard programs -> Structured logs 17

• Getting Started ✅ • Adoption ✅ • Debugging ✅
• Relationships 🥲 Logs • Volume 🥲 • Standardisation 🥲 • Health 🥲 • System insights 🥲 18

Metrics 19

Metrics • Aggregated! • Fastest and Cheapest to understand system
health • Dimensions 20

• Getting Started 😐 • Adoption ✅ • Debugging 🥲
• Relationships 🥲 Metrics • Volume 😐 • Standardisation ✅ • Health ✅ • System insights ✅ 21

Traces 22

Traces • Relationships and Directions! • Scoped to a request/work
fl ow 23

• Getting Started 😐 • Adoption 🥲 • Debugging ✅
• Relationships ✅ Traces • Volume 😐 • Standardisation ✅ • Health 🥲 • System insights 🥲 24

Events 25

Events • Structured logs? • Schema based? • Domain Events
• Easier to adopt? • Can unlock co-relation • Dimensionality 26

• Getting Started 😐 • Adoption 🥲 • Debugging ✅
• Relationships 🥲 Events • Volume ✅ • Standardisation 🥲 • Health ✅ • System insights ✅ 27

Analyse • Debug • Root cause Real Time Post Factor 28

Analyse • Debug • Root cause SRE/DevOps Programmer/Developers 29

OpenTelemetry 30 80% of Telemetry data is unused

80% of Telemetry data is unused • Yet, we store
it and pay for the data that is unused! • Slow dashboards, concurrent access woes • No real time alerting • Cost vs. Performance vs. Retention tradeo ff s 31

For high precision observability, You need high precision control levers

Control Levers • Treat workloads di ff erently • Fast
vs. Slow Data Tiers • Policies • Declarative Observability 33

Levitate - A Managed Time Series Data Warehouse Thank You!
34

Breaking down the Pillars of Observability: F...

Breaking down the Pillars of Observability: From data to outcomes

Prathamesh Sonpatki

More Decks by Prathamesh Sonpatki

Featured

Transcript

AllDayDevops 2023 Breaking down the Pillars of Observability: From data

2

M.E.L.T. 3

Pillars of Obs Pillars of Observability Metrics Events Logs Traces

5

Why this matters today? • Workloads have changed • Infra

Why this matters today? • Workloads have changed • Infra

Why this matters today? • Volume • Velocity • Variety

Why this matters today? • Volume • Velocity • Variety

10

Outcomes we want • To not have downtimes • To

Questions we ask • What is wrong? • Did we

Answers we want • Know • Communicate • Recover •

Answers we want • System Health • Quick Decisions •

M.E.L.T. 201 15

Logs 16

Logs • Can be literally anything -> Unstructured logs •

• Getting Started ✅ • Adoption ✅ • Debugging ✅

Metrics 19

Metrics • Aggregated! • Fastest and Cheapest to understand system

• Getting Started 😐 • Adoption ✅ • Debugging 🥲

Traces 22

Traces • Relationships and Directions! • Scoped to a request/work

• Getting Started 😐 • Adoption 🥲 • Debugging ✅

Events 25

Events • Structured logs? • Schema based? • Domain Events

• Getting Started 😐 • Adoption 🥲 • Debugging ✅

Answers we want • Know • Communicate • Recover •

Answers we want • Know • Communicate • Recover •

OpenTelemetry 30 80% of Telemetry data is unused

80% of Telemetry data is unused • Yet, we store

For high precision observability, You need high precision control levers

Control Levers • Treat workloads di ff erently • Fast

Levitate - A Managed Time Series Data Warehouse Thank You!