A Field Guide to Observability - DevOpsDays MSP 2017

Slide 1

Slide 1 text

Field Guide to Observability Aneel Lakhani - Honeycomb

Slide 2

Slide 2 text

Whothe****isthis

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

Non sequitur!

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

Observawhatthehellareyoutalkingabout

Slide 8

Slide 8 text

Observability [Wikipedia formal] A system is said to be observable if, for any possible sequence of state and control vectors, the current state can be determined in finite time using only the outputs

Slide 9

Slide 9 text

Observability [Wikipedia not formal] One can determine the behavior of the entire system from the system's outputs If a system is not observable, this means that the current values of some of its states cannot be determined through output sensors

Slide 10

Slide 10 text

Observability [Sam Stokes] Observability is a combination of a property of the system--it is observable--and of our (and our tools) ability with regards to that property--we can observe it

Slide 11

Slide 11 text

Observability [Me] Observability is a combination of a property of the system--it is observable--and of our (and our tools) actions with regards to that property--we observe… ...that leads to sufficiently contextually accurate understanding of the state(s) of the system to orient it within its operational context, decide on a course of action, and act on it to some ends (further observation, remediation, or improvement)

Slide 12

Slide 12 text

ObservawhythehellshouldIcare

Slide 13

Slide 13 text

When you have observability You know what you code/service/database/middleware/cloud/system looks like when it’s doing what it’s supposed to do You know what it looks like when it’s not doing what it’s supposed to do You ask questions and get answers about the behavior of everything in your stack If you can’t get answers now, you can enable yourself to get answers later

Slide 14

Slide 14 text

When you don’t You cannot distinguish between operational and non-operational behavior You cannot distinguish between optimal and degraded performance You cannot discover causes of problems inside your stack You cannot discover causes of problems outside of your stack

Slide 15

Slide 15 text

Observawhatisitmadeof

Slide 16

Slide 16 text

Observability is not any of... Logs Metrics Traces Events Time series Exceptions Alerts Logging Monitoring Alerting Dashboarding Math-ing Searching

Slide 17

Slide 17 text

Observability is all of... Instrumentation Something happens and we emit information about it Transportation The information is transported to an observer or observation tool Observation We make sense of it

Slide 18

Slide 18 text

Instrumentation The scaffolding for generating information (telemetry and events) from your systems/services/databases/apps/etc Telemetry is the measured output, or results, of some events: usually numerical stats of state or a change in state or time in state--counters & gauges & timers Events are… things that happen--[un]structured data, mixed text and numerics

Slide 19

Slide 19 text

Slide 20

Slide 20 text

Observation Observation is.. well, what it sounds like-- search, group, count, aggregate, filter, derive, transform, calculate, pivot, visualize, compare ..and can happen in any variety of interfaces-- term, loggregator, tsdb, monitor, dashboard, ad nauseum

Slide 21

Slide 21 text

There is no ONE way

Slide 22

Slide 22 text

There is no ONE tool

Slide 23

Slide 23 text

There is only YOUR way

Slide 24

Slide 24 text

There is only YOUR toolchain

Slide 25

Slide 25 text

Metrics Events Profiles Traces Errors Checks Users

Slide 26

Slide 26 text

CPU stats Mem stats Logins / time Queue length Jenufa login from CZ ASG +instance Canary deployed Code path timing Data struct cache coherency User login dependency tree Timing per service by shopping cart id Build failed Kernel panic Auth exception thrown Ping failed HTTP 200 OK Event submitted was able to be read Slow web experience reports Support request trends

Slide 27

Slide 27 text

Context is everything

Slide 28

Slide 28 text

Context is the means by which we orient Which service? Which machines? Which geography? Which data centers? Which users? Which API endpoints? What time? What duration? What severity level? What else happened then? What happened before? What happened after Which cluster? Which browser version? Which shard? Which OS version? Which SDK version Which deploy number?

Slide 29

Slide 29 text

{ “time”: “2017-06-14T20:44:04+00:00”, “location”: “hash_blorp.go:68”, “Message”: “oops” } { “time”: “2017-06-14T20:44:04+00:00”, “Location”: { “Region”: “us-east-1”, “Zone”: “us-east-1a”, “Host”: “10.0.0.4”, “Service”: “dogpics”, “Version”: “1.0.0”, “Endpoint”: “/dogs/search” “Params”: “good dogs” } “Message”: “oops” } { “time”: “2017-06-14T20:44:04+00:00”, “Location”: { “Region”: “us-east-1”, “Zone”: “us-east-1a”, “Host”: “10.0.0.4”, “Service”: “dogpics”, “Version”: “1.0.0”, “Endpoint”: “/dogs/search”, “Params”: “good dogs” }, “User”: { “User_id”: “some_guid”, “Originating_Host”: “94.100.180.199”, “Remote_Host”: “10.0.0.35”, “User_Agent”: “dogsview” } “Message”: “oops” }

Slide 30

Slide 30 text

Build observable systems

Slide 31

Slide 31 text

“It’ll get you where you’re going” - Cheslock

Slide 32

Slide 32 text

Thanks! [email protected] | @aneel Try Honeycomb! honeycomb.io/signup

Slide 33

Slide 33 text

Reading material - John Allspaw: An Open Letter To Monitoring/Metrics/Alerting Companies - Sam Stokes: Build Observable Systems - Cindy Sridharan: Logs and Metrics - Mark McBride: Moar Context Better Events - Ben Treynor @ SRECon14: Keys to SRE - Simple Sensor Format - Wikipedia Observability