Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A Field Guide to Observability - DevOpsDays MSP 2017

aneel
July 26, 2017

A Field Guide to Observability - DevOpsDays MSP 2017

Presented at DevOpsDays Minneapolis 2017.

More info and video in blog post: TBD

Operating with insufficient data is a failing proposition; you can’t operate what you can’t measure. So we have to measure things, and measurement starts early in the development lifecycle.

Let’s walk through a brief field guide to the theory and practice of observability

aneel

July 26, 2017
Tweet

More Decks by aneel

Other Decks in Technology

Transcript

  1. Observability [Wikipedia formal] A system is said to be observable

    if, for any possible sequence of state and control vectors, the current state can be determined in finite time using only the outputs
  2. Observability [Wikipedia not formal] One can determine the behavior of

    the entire system from the system's outputs If a system is not observable, this means that the current values of some of its states cannot be determined through output sensors
  3. Observability [Sam Stokes] Observability is a combination of a property

    of the system--it is observable--and of our (and our tools) ability with regards to that property--we can observe it
  4. Observability [Me] Observability is a combination of a property of

    the system--it is observable--and of our (and our tools) actions with regards to that property--we observe… ...that leads to sufficiently contextually accurate understanding of the state(s) of the system to orient it within its operational context, decide on a course of action, and act on it to some ends (further observation, remediation, or improvement)
  5. When you have observability You know what you code/service/database/middleware/cloud/system looks

    like when it’s doing what it’s supposed to do You know what it looks like when it’s not doing what it’s supposed to do You ask questions and get answers about the behavior of everything in your stack If you can’t get answers now, you can enable yourself to get answers later
  6. When you don’t You cannot distinguish between operational and non-operational

    behavior You cannot distinguish between optimal and degraded performance You cannot discover causes of problems inside your stack You cannot discover causes of problems outside of your stack
  7. Observability is not any of... Logs Metrics Traces Events Time

    series Exceptions Alerts Logging Monitoring Alerting Dashboarding Math-ing Searching
  8. Observability is all of... Instrumentation Something happens and we emit

    information about it Transportation The information is transported to an observer or observation tool Observation We make sense of it
  9. Instrumentation The scaffolding for generating information (telemetry and events) from

    your systems/services/databases/apps/etc Telemetry is the measured output, or results, of some events: usually numerical stats of state or a change in state or time in state--counters & gauges & timers Events are… things that happen--[un]structured data, mixed text and numerics
  10. Transportation How we encode that information-- metrics | logs |

    traces | json | ssf | etc ...and transmit it to our various tools and eyeballs-- stdout | message queues | data pipelines | etc
  11. Observation Observation is.. well, what it sounds like-- search, group,

    count, aggregate, filter, derive, transform, calculate, pivot, visualize, compare ..and can happen in any variety of interfaces-- term, loggregator, tsdb, monitor, dashboard, ad nauseum
  12. CPU stats Mem stats Logins / time Queue length Jenufa

    login from CZ ASG +instance Canary deployed Code path timing Data struct cache coherency User login dependency tree Timing per service by shopping cart id Build failed Kernel panic Auth exception thrown Ping failed HTTP 200 OK Event submitted was able to be read Slow web experience reports Support request trends
  13. Context is the means by which we orient Which service?

    Which machines? Which geography? Which data centers? Which users? Which API endpoints? What time? What duration? What severity level? What else happened then? What happened before? What happened after Which cluster? Which browser version? Which shard? Which OS version? Which SDK version Which deploy number?
  14. { “time”: “2017-06-14T20:44:04+00:00”, “location”: “hash_blorp.go:68”, “Message”: “oops” } { “time”:

    “2017-06-14T20:44:04+00:00”, “Location”: { “Region”: “us-east-1”, “Zone”: “us-east-1a”, “Host”: “10.0.0.4”, “Service”: “dogpics”, “Version”: “1.0.0”, “Endpoint”: “/dogs/search” “Params”: “good dogs” } “Message”: “oops” } { “time”: “2017-06-14T20:44:04+00:00”, “Location”: { “Region”: “us-east-1”, “Zone”: “us-east-1a”, “Host”: “10.0.0.4”, “Service”: “dogpics”, “Version”: “1.0.0”, “Endpoint”: “/dogs/search”, “Params”: “good dogs” }, “User”: { “User_id”: “some_guid”, “Originating_Host”: “94.100.180.199”, “Remote_Host”: “10.0.0.35”, “User_Agent”: “dogsview” } “Message”: “oops” }
  15. Reading material - John Allspaw: An Open Letter To Monitoring/Metrics/Alerting

    Companies - Sam Stokes: Build Observable Systems - Cindy Sridharan: Logs and Metrics - Mark McBride: Moar Context Better Events - Ben Treynor @ SRECon14: Keys to SRE - Simple Sensor Format - Wikipedia Observability