Breaking down the Pillars of Observability: From data to outcomes

Slide 1

Slide 1 text

AllDayDevops 2023 Breaking down the Pillars of Observability: From data to outcomes Prathamesh Sonpatki, last9.io 1

Slide 2

Slide 2 text

Slide 3

Slide 3 text

M.E.L.T. 3

Slide 4

Slide 4 text

Pillars of Obs Pillars of Observability Metrics Events Logs Traces 4

Slide 5

Slide 5 text

Slide 6

Slide 6 text

Why this matters today? • Workloads have changed • Infra is cattle - ephemeral • Services are dynamic • Cloud Native Environments A 3 node cluster running 10 namespaces with 5 deployments with a replica set of ~3-5 with 10 config maps emits whooping 16566 time series per minute using the popular kube-state-metrics library 6

Slide 7

Slide 7 text

Why this matters today? • Workloads have changed • Infra is cattle - ephemeral • Services are dynamic • Cloud Native Environments • Pod Metrics • Deployment Metrics • ReplicaSet Metrics • StatefulSet Metrics • DaemonSet Metrics • Job Metrics • Service Metrics • Namespace Metrics • Node Metrics 7

Slide 8

Slide 8 text

Why this matters today? • Volume • Velocity • Variety • Complexity 8

Slide 9

Slide 9 text

Why this matters today? • Volume • Velocity • Variety • Complexity •C.O.S.T. - Cardinality - Operations - Scale - Toil 9

Slide 10

Slide 10 text

Slide 11

Slide 11 text

Outcomes we want • To not have downtimes • To mitigate problems quickly • To debug a failure • To know how the system is behaving in real time • To co-relate an outage to a hardware failure • To fi nd anomalies and patterns • To trace a payment failure • To fi nd out unknown failures before they happen • To prevent hampering customer experience and business impact 11

Slide 12

Slide 12 text

Questions we ask • What is wrong? • Did we change anything? • What do we do so this doesn’t repeat? 12

Slide 13

Slide 13 text

Answers we want • Know • Communicate • Recover • Analyse • Debug • Root cause 13

Slide 14

Slide 14 text

Answers we want • System Health • Quick Decisions • Time • Root Cause • Testing • Correctness 14

Slide 15

Slide 15 text

M.E.L.T. 201 15

Slide 16

Slide 16 text

Logs 16

Slide 17

Slide 17 text

Logs • Can be literally anything -> Unstructured logs • Standard programs -> Structured logs 17

Slide 18

Slide 18 text

• Getting Started ✅ • Adoption ✅ • Debugging ✅ • Relationships 🥲 Logs • Volume 🥲 • Standardisation 🥲 • Health 🥲 • System insights 🥲 18

Slide 19

Slide 19 text

Metrics 19

Slide 20

Slide 20 text

Metrics • Aggregated! • Fastest and Cheapest to understand system health • Dimensions 20

Slide 21

Slide 21 text

• Getting Started 😐 • Adoption ✅ • Debugging 🥲 • Relationships 🥲 Metrics • Volume 😐 • Standardisation ✅ • Health ✅ • System insights ✅ 21

Slide 22

Slide 22 text

Traces 22

Slide 23

Slide 23 text

Traces • Relationships and Directions! • Scoped to a request/work fl ow 23

Slide 24

Slide 24 text

• Getting Started 😐 • Adoption 🥲 • Debugging ✅ • Relationships ✅ Traces • Volume 😐 • Standardisation ✅ • Health 🥲 • System insights 🥲 24

Slide 25

Slide 25 text

Events 25

Slide 26

Slide 26 text

Events • Structured logs? • Schema based? • Domain Events • Easier to adopt? • Can unlock co-relation • Dimensionality 26

Slide 27

Slide 27 text

• Getting Started 😐 • Adoption 🥲 • Debugging ✅ • Relationships 🥲 Events • Volume ✅ • Standardisation 🥲 • Health ✅ • System insights ✅ 27

Slide 28

Slide 28 text

Answers we want • Know • Communicate • Recover • Analyse • Debug • Root cause Real Time Post Factor 28

Slide 29

Slide 29 text

Answers we want • Know • Communicate • Recover • Analyse • Debug • Root cause SRE/DevOps Programmer/Developers 29

Slide 30

Slide 30 text

OpenTelemetry 30 80% of Telemetry data is unused

Slide 31

Slide 31 text

80% of Telemetry data is unused • Yet, we store it and pay for the data that is unused! • Slow dashboards, concurrent access woes • No real time alerting • Cost vs. Performance vs. Retention tradeo ff s 31

Slide 32

Slide 32 text

For high precision observability, You need high precision control levers

Slide 33

Slide 33 text

Control Levers • Treat workloads di ff erently • Fast vs. Slow Data Tiers • Policies • Declarative Observability 33

Slide 34

Slide 34 text

Levitate - A Managed Time Series Data Warehouse Thank You! 34