Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The hidden cost of instrumentation at Conf42 Devops 2023

The hidden cost of instrumentation at Conf42 Devops 2023

Prathamesh Sonpatki

January 27, 2023
Tweet

More Decks by Prathamesh Sonpatki

Other Decks in Technology

Transcript

  1. The hidden cost of the
    instrumentation
    1
    Prathamesh Sonpatki
    Last9.io
    Conf42 Devops 2023

    View full-size slide

  2. 2
    Instrumentation? 🤔 🤨

    View full-size slide

  3. 3
    Instrumentation? 🤔 🤨
    - How do you know your application is running as expected?

    View full-size slide

  4. 4
    Instrumentation? 🤔 🤨
    - How do you know your application is running as expected?
    - Service Level Agreements(SLA)

    View full-size slide

  5. 5
    Instrumentation? 🤔 🤨
    - How do you know your application is running as expected?
    - Service Level Agreements(SLA)
    - Good night’s sleep 😴 💤

    View full-size slide

  6. 💡“Hope is not a strategy!”
    6
    https://sre.google/sre-book/introduction/

    View full-size slide

  7. 💡The Reliability mandate starts with
    Instrumentation
    You can only improve what you measure.
    7

    View full-size slide

  8. 🌈 Landscape of the Instrumentation
    8

    View full-size slide

  9. 🌈 Landscape of the Instrumentation
    9
    - Your application is not standalone

    View full-size slide

  10. 🌈 Landscape of the Instrumentation
    10
    - Your application is not standalone
    - It’s actually a 🍔

    View full-size slide

  11. 🌈 Landscape of the Instrumentation
    11
    - Your application is not standalone
    - It’s actually a 🍔
    - The Bun(Cloud/VM)
    - Patty(application)
    - Along with Mayo sauce(RDS/DB)
    - And Ketchup(Third party services)

    View full-size slide

  12. 🌈 Landscape of the Instrumentation
    12
    - Your application is not standalone
    - It’s actually a 🍔
    - The Bun(Cloud/VM)
    - Patty(application)
    - Along with Mayo sauce(RDS/DB)
    - And Ketchup(Third party services)
    “Full stack observability” FTW!

    View full-size slide

  13. 💡Modern applications are like living
    organisms that grow and shrink in all
    possible directions.
    And also communicate with their friends!
    13

    View full-size slide

  14. Bow in the Temple of Observability 󰚍
    14

    View full-size slide

  15. Bow in the Temple of Observability 󰚍
    15
    - Logs
    - Metrics
    - Traces

    View full-size slide

  16. Bow in the Temple of Observability 󰚍
    16
    - Logs
    - Metrics
    - Traces
    - Profiling
    - Events (External)
    - Exceptions
    https://medium.com/@YuriShkuro/temple-six-pillars-of-observability-4ac3e
    3deb402

    View full-size slide

  17. Bow in the Temple of Observability 󰚍
    17
    - Logs
    - Metrics
    - Traces
    - Profiling
    - Events (External)
    - Exceptions
    How many people use more than 3 from these at the same time??

    View full-size slide

  18. There ain’t no such thing as free lunch 💰
    18

    View full-size slide

  19. Cardinality/Churn
    19
    - Capturing monitoring data is easier than ever today.
    - A 3-node Kubernetes cluster with Prometheus will ship around 40k
    active series by default!

    View full-size slide

  20. Operations
    - Run, manage and operate the instrumentation of the entire stack.
    - One more thing to operate besides the app.
    20

    View full-size slide

  21. Scale
    - Make sure not just your app scales but also your instrumentation.
    21

    View full-size slide

  22. Tuning/Toil
    - Constant tuning of monitoring data
    - Resulting into Engineering Toil
    22

    View full-size slide

  23. C.O.S.T. 💸
    Cardinality/Churn, Operations, Scale, Tuning/Toil
    23

    View full-size slide

  24. But what is the hidden
    cost? 🤔
    24

    View full-size slide

  25. Distraction!
    25

    View full-size slide

  26. Distraction!
    26
    - Reduce the Datadog monitoring cost, it is going out of hand.

    View full-size slide

  27. Distraction!
    27
    - Reduce the Datadog monitoring cost, it is going out of hand.
    - Our logs are piling up from last 2 days, can you please look at it as P0
    and contain them? Otherwise vendor will charge us double.

    View full-size slide

  28. Distraction!
    28
    - Reduce the Datadog monitoring cost, it is going out of hand.
    - Our logs are piling up from last 2 days, can you please look at it as P0
    and contain them? Otherwise vendor will charge us double.
    - Today is new year’s day and our prometheus is not getting required
    metrics. Ignore the product release, just fix this for now, we are blind
    otherwise.

    View full-size slide

  29. 💡A modern systems engineer has to not
    just maintain their software but also
    Instrumentation of that software.
    29

    View full-size slide

  30. Fatigue!
    31
    - Too much information de-sensitises us.
    - Duplicate alarms.
    - Focus on getting more and more data rather than why even we are
    getting it.
    - Debugging becomes difficult because there is just too much of data, we
    don’t know from where to start.

    View full-size slide

  31. What’s the way out? 🏆
    32

    View full-size slide

  32. What’s the way out? 🏆
    33
    - Focus on data that gives early warnings with least amount of data

    View full-size slide

  33. What’s the way out? 🏆
    34
    - Focus on data that gives early warnings with least amount of data
    - Think about Apple watch ⌚ - only vitals such as heart rate or sleep
    metric.

    View full-size slide

  34. What’s the way out? 🏆
    35
    - Focus on data that gives early warnings with least amount of data
    - Think about Apple watch ⌚ - only vitals such as heart rate or sleep
    metric.
    - Detailed X-Ray scans and ECG reports 📰 once the vitals are off the
    track.

    View full-size slide

  35. 💡A threat of breaking is better.
    36

    View full-size slide

  36. Plan of action
    37

    View full-size slide

  37. Plan of action
    38
    - Plan what to measure why not how
    - Emit (only what you need)
    - Observe and Track (usage)
    - Prune (unused) aggressively
    - Store less for less amount of time.
    - Focus on what can give best value for the money

    View full-size slide

  38. A Better Plan of action
    39
    - Access Policies
    - Data storage policies
    - Standards

    View full-size slide

  39. 💡Less is better.
    Because Instrumentation is liability.
    40

    View full-size slide

  40. Thanks
    41
    Prathamesh Sonpatki
    Last9.io
    Blog - https://prathamesh.tech
    Twitter -
    https://twitter.com/_cha1tanya
    Matsodon -
    https://hachyderm.io/@Prathamesh
    “Last9 of Reliability” Discord

    View full-size slide