A Field Guide to Observability - DevOpsDays MSP 2017

Field Guide to Observability Aneel Lakhani - Honeycomb

Whothe****isthis

Non sequitur!

Observawhatthehellareyoutalkingabout

Observability [Wikipedia formal] A system is said to be observable
if, for any possible sequence of state and control vectors, the current state can be determined in finite time using only the outputs

Observability [Wikipedia not formal] One can determine the behavior of
the entire system from the system's outputs If a system is not observable, this means that the current values of some of its states cannot be determined through output sensors

Observability [Sam Stokes] Observability is a combination of a property
of the system--it is observable--and of our (and our tools) ability with regards to that property--we can observe it

Observability [Me] Observability is a combination of a property of
the system--it is observable--and of our (and our tools) actions with regards to that property--we observe… ...that leads to sufficiently contextually accurate understanding of the state(s) of the system to orient it within its operational context, decide on a course of action, and act on it to some ends (further observation, remediation, or improvement)

ObservawhythehellshouldIcare

When you have observability You know what you code/service/database/middleware/cloud/system looks
like when it’s doing what it’s supposed to do You know what it looks like when it’s not doing what it’s supposed to do You ask questions and get answers about the behavior of everything in your stack If you can’t get answers now, you can enable yourself to get answers later

When you don’t You cannot distinguish between operational and non-operational
behavior You cannot distinguish between optimal and degraded performance You cannot discover causes of problems inside your stack You cannot discover causes of problems outside of your stack

Observawhatisitmadeof

Observability is not any of... Logs Metrics Traces Events Time
series Exceptions Alerts Logging Monitoring Alerting Dashboarding Math-ing Searching

Observability is all of... Instrumentation Something happens and we emit
information about it Transportation The information is transported to an observer or observation tool Observation We make sense of it

Instrumentation The scaffolding for generating information (telemetry and events) from
your systems/services/databases/apps/etc Telemetry is the measured output, or results, of some events: usually numerical stats of state or a change in state or time in state--counters & gauges & timers Events are… things that happen--[un]structured data, mixed text and numerics

Observation Observation is.. well, what it sounds like-- search, group,
count, aggregate, filter, derive, transform, calculate, pivot, visualize, compare ..and can happen in any variety of interfaces-- term, loggregator, tsdb, monitor, dashboard, ad nauseum

There is no ONE way

There is no ONE tool

There is only YOUR way

There is only YOUR toolchain

Metrics Events Profiles Traces Errors Checks Users

CPU stats Mem stats Logins / time Queue length Jenufa
login from CZ ASG +instance Canary deployed Code path timing Data struct cache coherency User login dependency tree Timing per service by shopping cart id Build failed Kernel panic Auth exception thrown Ping failed HTTP 200 OK Event submitted was able to be read Slow web experience reports Support request trends

Context is everything

Context is the means by which we orient Which service?
Which machines? Which geography? Which data centers? Which users? Which API endpoints? What time? What duration? What severity level? What else happened then? What happened before? What happened after Which cluster? Which browser version? Which shard? Which OS version? Which SDK version Which deploy number?

{ “time”: “2017-06-14T20:44:04+00:00”, “location”: “hash_blorp.go:68”, “Message”: “oops” } { “time”:
“2017-06-14T20:44:04+00:00”, “Location”: { “Region”: “us-east-1”, “Zone”: “us-east-1a”, “Host”: “10.0.0.4”, “Service”: “dogpics”, “Version”: “1.0.0”, “Endpoint”: “/dogs/search” “Params”: “good dogs” } “Message”: “oops” } { “time”: “2017-06-14T20:44:04+00:00”, “Location”: { “Region”: “us-east-1”, “Zone”: “us-east-1a”, “Host”: “10.0.0.4”, “Service”: “dogpics”, “Version”: “1.0.0”, “Endpoint”: “/dogs/search”, “Params”: “good dogs” }, “User”: { “User_id”: “some_guid”, “Originating_Host”: “94.100.180.199”, “Remote_Host”: “10.0.0.35”, “User_Agent”: “dogsview” } “Message”: “oops” }

Build observable systems

“It’ll get you where you’re going” - Cheslock

Thanks! [email protected] | @aneel Try Honeycomb! honeycomb.io/signup

Reading material - John Allspaw: An Open Letter To Monitoring/Metrics/Alerting
Companies - Sam Stokes: Build Observable Systems - Cindy Sridharan: Logs and Metrics - Mark McBride: Moar Context Better Events - Ben Treynor @ SRECon14: Keys to SRE - Simple Sensor Format - Wikipedia Observability

A Field Guide to Observability - DevOpsDays MSP...

A Field Guide to Observability - DevOpsDays MSP 2017

aneel

More Decks by aneel

Other Decks in Technology

Featured

Transcript

Field Guide to Observability Aneel Lakhani - Honeycomb

Whothe****isthis

Non sequitur!

Observawhatthehellareyoutalkingabout

Observability [Wikipedia formal] A system is said to be observable

Observability [Wikipedia not formal] One can determine the behavior of

Observability [Sam Stokes] Observability is a combination of a property

Observability [Me] Observability is a combination of a property of

ObservawhythehellshouldIcare

When you have observability You know what you code/service/database/middleware/cloud/system looks

When you don’t You cannot distinguish between operational and non-operational

Observawhatisitmadeof

Observability is not any of... Logs Metrics Traces Events Time

Observability is all of... Instrumentation Something happens and we emit

Instrumentation The scaffolding for generating information (telemetry and events) from

Transportation How we encode that information-- metrics | logs |

Observation Observation is.. well, what it sounds like-- search, group,

There is no ONE way

There is no ONE tool

There is only YOUR way

There is only YOUR toolchain

Metrics Events Profiles Traces Errors Checks Users

CPU stats Mem stats Logins / time Queue length Jenufa

Context is everything

Context is the means by which we orient Which service?

{ “time”: “2017-06-14T20:44:04+00:00”, “location”: “hash_blorp.go:68”, “Message”: “oops” } { “time”:

Build observable systems

“It’ll get you where you’re going” - Cheslock

Thanks! [email protected] | @aneel Try Honeycomb! honeycomb.io/signup

Reading material - John Allspaw: An Open Letter To Monitoring/Metrics/Alerting