Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A Field Guide to Observability - DevOpsDays MSP 2017

8e9f4a1fa29a0923f59f3037bbaa824e?s=47 aneel
July 26, 2017

A Field Guide to Observability - DevOpsDays MSP 2017

Presented at DevOpsDays Minneapolis 2017.

More info and video in blog post: TBD

Operating with insufficient data is a failing proposition; you can’t operate what you can’t measure. So we have to measure things, and measurement starts early in the development lifecycle.

Let’s walk through a brief field guide to the theory and practice of observability



July 26, 2017


  1. Field Guide to Observability Aneel Lakhani - Honeycomb

  2. Whothe****isthis

  3. None
  4. Non sequitur!

  5. None
  6. None
  7. Observawhatthehellareyoutalkingabout

  8. Observability [Wikipedia formal] A system is said to be observable

    if, for any possible sequence of state and control vectors, the current state can be determined in finite time using only the outputs
  9. Observability [Wikipedia not formal] One can determine the behavior of

    the entire system from the system's outputs If a system is not observable, this means that the current values of some of its states cannot be determined through output sensors
  10. Observability [Sam Stokes] Observability is a combination of a property

    of the system--it is observable--and of our (and our tools) ability with regards to that property--we can observe it
  11. Observability [Me] Observability is a combination of a property of

    the system--it is observable--and of our (and our tools) actions with regards to that property--we observe… ...that leads to sufficiently contextually accurate understanding of the state(s) of the system to orient it within its operational context, decide on a course of action, and act on it to some ends (further observation, remediation, or improvement)
  12. ObservawhythehellshouldIcare

  13. When you have observability You know what you code/service/database/middleware/cloud/system looks

    like when it’s doing what it’s supposed to do You know what it looks like when it’s not doing what it’s supposed to do You ask questions and get answers about the behavior of everything in your stack If you can’t get answers now, you can enable yourself to get answers later
  14. When you don’t You cannot distinguish between operational and non-operational

    behavior You cannot distinguish between optimal and degraded performance You cannot discover causes of problems inside your stack You cannot discover causes of problems outside of your stack
  15. Observawhatisitmadeof

  16. Observability is not any of... Logs Metrics Traces Events Time

    series Exceptions Alerts Logging Monitoring Alerting Dashboarding Math-ing Searching
  17. Observability is all of... Instrumentation Something happens and we emit

    information about it Transportation The information is transported to an observer or observation tool Observation We make sense of it
  18. Instrumentation The scaffolding for generating information (telemetry and events) from

    your systems/services/databases/apps/etc Telemetry is the measured output, or results, of some events: usually numerical stats of state or a change in state or time in state--counters & gauges & timers Events are… things that happen--[un]structured data, mixed text and numerics
  19. Transportation How we encode that information-- metrics | logs |

    traces | json | ssf | etc ...and transmit it to our various tools and eyeballs-- stdout | message queues | data pipelines | etc
  20. Observation Observation is.. well, what it sounds like-- search, group,

    count, aggregate, filter, derive, transform, calculate, pivot, visualize, compare ..and can happen in any variety of interfaces-- term, loggregator, tsdb, monitor, dashboard, ad nauseum
  21. There is no ONE way

  22. There is no ONE tool

  23. There is only YOUR way

  24. There is only YOUR toolchain

  25. Metrics Events Profiles Traces Errors Checks Users

  26. CPU stats Mem stats Logins / time Queue length Jenufa

    login from CZ ASG +instance Canary deployed Code path timing Data struct cache coherency User login dependency tree Timing per service by shopping cart id Build failed Kernel panic Auth exception thrown Ping failed HTTP 200 OK Event submitted was able to be read Slow web experience reports Support request trends
  27. Context is everything

  28. Context is the means by which we orient Which service?

    Which machines? Which geography? Which data centers? Which users? Which API endpoints? What time? What duration? What severity level? What else happened then? What happened before? What happened after Which cluster? Which browser version? Which shard? Which OS version? Which SDK version Which deploy number?
  29. { “time”: “2017-06-14T20:44:04+00:00”, “location”: “hash_blorp.go:68”, “Message”: “oops” } { “time”:

    “2017-06-14T20:44:04+00:00”, “Location”: { “Region”: “us-east-1”, “Zone”: “us-east-1a”, “Host”: “”, “Service”: “dogpics”, “Version”: “1.0.0”, “Endpoint”: “/dogs/search” “Params”: “good dogs” } “Message”: “oops” } { “time”: “2017-06-14T20:44:04+00:00”, “Location”: { “Region”: “us-east-1”, “Zone”: “us-east-1a”, “Host”: “”, “Service”: “dogpics”, “Version”: “1.0.0”, “Endpoint”: “/dogs/search”, “Params”: “good dogs” }, “User”: { “User_id”: “some_guid”, “Originating_Host”: “”, “Remote_Host”: “”, “User_Agent”: “dogsview” } “Message”: “oops” }
  30. Build observable systems

  31. “It’ll get you where you’re going” - Cheslock

  32. Thanks! aneel@honeycomb.io | @aneel Try Honeycomb! honeycomb.io/signup

  33. Reading material - John Allspaw: An Open Letter To Monitoring/Metrics/Alerting

    Companies - Sam Stokes: Build Observable Systems - Cindy Sridharan: Logs and Metrics - Mark McBride: Moar Context Better Events - Ben Treynor @ SRECon14: Keys to SRE - Simple Sensor Format - Wikipedia Observability