Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Breaking down the Pillars of Observability: From data to outcomes

Prathamesh Sonpatki
October 26, 2023
14

Breaking down the Pillars of Observability: From data to outcomes

Prathamesh Sonpatki

October 26, 2023
Tweet

Transcript

  1. AllDayDevops 2023
    Breaking down the Pillars
    of Observability:
    From data
    to outcomes
    Prathamesh Sonpatki, last9.io
    1

    View full-size slide

  2. Pillars of Obs
    Pillars of Observability
    Metrics
    Events
    Logs
    Traces
    4

    View full-size slide

  3. Why this matters today?
    • Workloads have changed

    • Infra is cattle - ephemeral

    • Services are dynamic

    • Cloud Native Environments
    A 3 node cluster
    running 10 namespaces
    with 5 deployments
    with a replica set of ~3-5
    with 10 config maps
    emits whooping 16566 time series per minute
    using the popular kube-state-metrics library
    6

    View full-size slide

  4. Why this matters today?
    • Workloads have changed

    • Infra is cattle - ephemeral

    • Services are dynamic

    • Cloud Native Environments
    • Pod Metrics
    • Deployment Metrics
    • ReplicaSet Metrics
    • StatefulSet Metrics
    • DaemonSet Metrics
    • Job Metrics
    • Service Metrics
    • Namespace Metrics
    • Node Metrics
    7

    View full-size slide

  5. Why this matters today?
    • Volume

    • Velocity

    • Variety

    • Complexity
    8

    View full-size slide

  6. Why this matters today?
    • Volume

    • Velocity

    • Variety

    • Complexity

    •C.O.S.T.
    - Cardinality

    - Operations

    - Scale

    - Toil
    9

    View full-size slide

  7. Outcomes we want
    • To not have downtimes
    • To mitigate problems quickly
    • To debug a failure
    • To know how the system is behaving in real time
    • To co-relate an outage to a hardware failure
    • To
    fi
    nd anomalies and patterns
    • To trace a payment failure
    • To
    fi
    nd out unknown failures before they happen
    • To prevent hampering customer experience and business impact
    11

    View full-size slide

  8. Questions we ask
    • What is wrong?

    • Did we change anything?

    • What do we do so this doesn’t repeat?
    12

    View full-size slide

  9. Answers we want
    • Know

    • Communicate

    • Recover
    • Analyse

    • Debug

    • Root cause
    13

    View full-size slide

  10. Answers we want
    • System Health

    • Quick Decisions

    • Time
    • Root Cause

    • Testing

    • Correctness
    14

    View full-size slide

  11. M.E.L.T.
    201
    15

    View full-size slide

  12. Logs
    • Can be literally anything -> Unstructured logs

    • Standard programs -> Structured logs
    17

    View full-size slide

  13. • Getting Started ✅

    • Adoption ✅

    • Debugging ✅

    • Relationships 🥲
    Logs
    • Volume 🥲

    • Standardisation 🥲

    • Health 🥲

    • System insights 🥲
    18

    View full-size slide

  14. Metrics
    • Aggregated!

    • Fastest and Cheapest to understand system health

    • Dimensions
    20

    View full-size slide

  15. • Getting Started 😐

    • Adoption ✅

    • Debugging 🥲

    • Relationships 🥲
    Metrics
    • Volume 😐

    • Standardisation ✅

    • Health ✅

    • System insights ✅
    21

    View full-size slide

  16. Traces
    • Relationships and Directions!

    • Scoped to a request/work
    fl
    ow
    23

    View full-size slide

  17. • Getting Started 😐

    • Adoption 🥲

    • Debugging ✅

    • Relationships ✅
    Traces
    • Volume 😐

    • Standardisation ✅

    • Health 🥲

    • System insights 🥲
    24

    View full-size slide

  18. Events
    • Structured logs?

    • Schema based?

    • Domain Events

    • Easier to adopt?

    • Can unlock co-relation

    • Dimensionality
    26

    View full-size slide

  19. • Getting Started 😐

    • Adoption 🥲

    • Debugging ✅

    • Relationships 🥲
    Events
    • Volume ✅

    • Standardisation 🥲

    • Health ✅

    • System insights ✅
    27

    View full-size slide

  20. Answers we want
    • Know

    • Communicate

    • Recover
    • Analyse

    • Debug

    • Root cause
    Real Time Post Factor
    28

    View full-size slide

  21. Answers we want
    • Know

    • Communicate

    • Recover
    • Analyse

    • Debug

    • Root cause
    SRE/DevOps Programmer/Developers
    29

    View full-size slide

  22. OpenTelemetry
    30
    80% of Telemetry data is unused

    View full-size slide

  23. 80% of Telemetry data is unused
    • Yet, we store it and pay for the data that is unused!

    • Slow dashboards, concurrent access woes

    • No real time alerting

    • Cost vs. Performance vs. Retention tradeo
    ff
    s
    31

    View full-size slide

  24. For high precision observability,
    You need high precision control levers

    View full-size slide

  25. Control Levers
    • Treat workloads di
    ff
    erently

    • Fast vs. Slow Data Tiers

    • Policies

    • Declarative Observability
    33

    View full-size slide

  26. Levitate - A Managed Time
    Series Data Warehouse
    Thank You!
    34

    View full-size slide