Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Monitoring vs. Debugging - IG Meetup 22nd July

Monitoring vs. Debugging - IG Meetup 22nd July

Outline

- Two categories of using observability data - monitoring vs. debugging- explain the difference.
- Explain Metrics, Logs, Traces, and Events in the context of monitoring vs. debugging.
- Rise of open standards
- Importance of OpenTelemetry, OpenMetrics
- Convergence toward common goals
- Vendor-neutral
- Common ground
- Application monitoring using OpenTelemetry
- Automatic instrumentation vs. Manual instrumentation
- Not all data is the same
- 80% of metrics are unused
- Control levers for high precision monitoring and o11y
- Treat workloads differently
- Tiers
- Policies
- Data Engineering
- Today’s SRE/DevOps need control levers for managing o11y.

Relevant blog posts

Pillars of Observability

Observability - OSS vs. Paid vs. Managed OSS

Taking back control of your monitoring

What is Prometheus Remote Write

Prathamesh Sonpatki

July 22, 2023
Tweet

More Decks by Prathamesh Sonpatki

Other Decks in Programming

Transcript

  1. 2

  2. 4

  3. Why this matters today? • Workloads have changed • Infra

    is cattle - ephemeral • Services are dynamic • Push to Cloud A 3 node cluster running 10 namespaces with 5 deployments with a replica set of ~3-5 with 10 config maps emits whooping 16566 time series per minute using the popular kube-state-metrics library 6
  4. Why this matters today? • Workloads have changed • Infra

    is cattle - ephemeral • Services are dynamic • Push to Cloud • Pod Metrics • Deployment Metrics • ReplicaSet Metrics • StatefulSet Metrics • DaemonSet Metrics • Job Metrics • Service Metrics • Namespace Metrics • Node Metrics 7
  5. Why this matters today? • Volume • Velocity • Variety

    • Complexity •C.O.S.T. - Cardinality - Operations - Scale - Toil 9
  6. 10

  7. Outcomes we want • To not have downtimes • To

    mitigate problems quickly • To debug a failure • To know how the system is behaving in real time • To co-relate an outage to a hardware failure • To fi nd anomalies and patterns • To trace a payment failure • To fi nd out unknown failures before they happen • To prevent hampering customer experience and business impact 11
  8. Questions we ask • What is wrong? • Did we

    change anything? • What do we do so this doesn’t repeat? 12
  9. Answers we want • Know • Communicate • Recover •

    Analyse • Debug • Root cause 13
  10. Answers we want • System Health • Quick Decisions •

    Time • Root Cause • Testing • Correctness 14
  11. Logs • Can be literally anything -> Unstructured logs •

    Standard programs -> Structured logs 17
  12. • Getting Started ✅ • Adoption ✅ • Debugging ✅

    • Relationships 🥲 Logs • Volume 🥲 • Standardisation 🥲 • Health 🥲 • System insights 🥲 18
  13. • Getting Started 😐 • Adoption ✅ • Debugging 🥲

    • Relationships 🥲 Metrics • Volume ✅ • Standardisation ✅ • Health ✅ • System insights ✅ 21
  14. • Getting Started 😐 • Adoption 🥲 • Debugging ✅

    • Relationships ✅ Traces • Volume 😐 • Standardisation ✅ • Health 🥲 • System insights 🥲 24
  15. Events • Structured logs? • Schema based? • Domain Events

    • Easier to adopt? • Can unlock co-relation • Dimensionality 26
  16. • Getting Started 😐 • Adoption 🥲 • Debugging ✅

    • Relationships 🥲 Events • Volume ✅ • Standardisation 🥲 • Health ✅ • System insights ✅ 27
  17. Answers we want • Know • Communicate • Recover •

    Analyse • Debug • Root cause Real Time Post Factor 28
  18. Common Goals, Common Ground • Continue to leverage open source

    innovations • Multiple options • No Truck factor 30
  19. Common Goals, Common Ground • Continue to leverage open source

    innovations • Multiple options • No Truck factor 31
  20. OpenTelemetry • Protocol, Speci fi cation and SDKs • Vendor

    neutral • Application monitoring • Automatic instrumentation especially for Java • Adoption from On-Prem to Cloud • Convert Legacy apps metrics to OpenTelemetry format 33
  21. 80% of Telemetry data is unused • Yet, we store

    it and pay for the data that is unused! • Slow dashboards, concurrent access woes • No real time alerting 36
  22. Control Levers • Treat workloads di ff erently • Tiers

    • Policies • Declarative Observability 38
  23. Control Levers • Treat workloads di ff erently • Tiers

    • Policies • Declarative Observability 39 Data Engineering