Observability-Driven Development

Charity Majors @mipsytipsy observability and the Development Process hi. the
title of this talk is "observability and the development process," and today I’d like to talk about how observability — a word associated with monitoring, with "over there in production-land" — is a mindset and a set of tools that can beneﬁt us not just after, but before and during our software development process. ✳ i’m also a cofounder at honeycomb, which is focused on empowering engineering teams to understand their software in production. as a disclaimer - while i’ll be sharing some examples about how we dogfood to build honeycomb, the philosophies and techniques are generally applicable and should be transferrable to your existing tools.

OBSERVABILITY-DRIVEN DEVELOPMENT “O-D-D! Yeah, you know me!” ~ someone, deﬁnitely
not me

@mipsytipsy engineer/cofounder/CEO who am i? charity, my background etc. golang
history, rewrite, etc

i’m going to talk about something new today. i came
here on purpose to do this. there’s something that’s been coming together in my brain for the past couple months, and i wanted to talk about it to a bunch of cutting edge engineers who would get it. the topic is ODD, observability driven development. I want to talk about the evolution i’m seeing underway from TDD to ODD, and why it matters, and how you do it. and what other trends it dovetails with and ampliﬁes. it’s been a phenomenon in search of a name.

1 - write test 2 - write code that passes
test let’s review. TDD is a methodology that says you write a b breaking test, refactor, pass the test.

1 - write test 2 - write code that passes
test 1 - define reality spec 2 - write code to spec over time, reality as defined by your test suite becomes richer and more all-encompassing. it contains things you forgot about long ago, things someone else knew and you never could have predicted. every failure you learn about, you wrap back into your test suites. this is awesome. TDD was a huge leap forward for us as an industry. it’s now table stakes, and thank goodness. but TDD has a couple of problems baked in to the model, and for reasons we will explore, tit is increasingly insufficient as a mental or development mokdel.

TDD saved our collective ass. But. (butt?) It stopped too
soon. It’s not enough.

TDD stops at your laptop’s edge TDD stops where your
laptop stops. it stops when you hit the network. that … sucks. even with a monolithic app it sucks, but with micro services holy suck

monitoring:TDD::observability::ODD “what do those terms mean, charity??” TDD can only
deal with known-unknowns secondly, it’s fundamentally reactive and can only account for your known unknowns. for those of you who have been paying attention to the observability space, we talk a lot lately about this in the context of lets take a brief tour through these terms and why they matter.

First: let’s set the stage with observability and the inglorious
future.

@grepory, 2016 “Monitoring systems have not changed significantly in 20
years and has fallen behind the way we build software. Our software is now large distributed systems made up of many non-uniform interacting components while the core functionality of monitoring systems has stagnated.” Monitoring(n) In greg’s talk he defined monitoring as “"Monitoring is the action of observing and checking the behavior and outputs of ia system and its components over time." Let’s define some terms: Monitoring, observability, metrics, etc.. how many of you are ops on call? it’s no longer practically possible to curate and tend the paging alerts and flaps and false alarms of most moderately complex systems. it burns out your humans, it doesn’t make your systems better,

Observability(n) “In control theory, observability is a measure of how
well internal states of a system can be inferred from knowledge of its external outputs. The observability and controllability of a system are mathematical duals." — wikipedia … translate??!? Observability is a term taken from control theory, as the wikipedia page says. It means how much can you understand about the inner workings of your software, from observing its outputs. I like to think of it as, your ability to understand what is happening at any given time *without* attaching a debugger to a process/ strace for systems, gdb for systems, etc how is this diﬀerent from monitorign? Well, it’s a good idea to separate them in your mind, because there are a lot of really well established best practices for monitoring that are not true for observability, and i don’t want to muddy the water too much.

Can you understand what’s happening inside your code and systems,
simply by asking questions using your tools? Can you answer any new question you think of, or only the ones you prepared for? Having to ship new code every time you want to ask a new question … SUCKS. Observability(n) Observability is a term taken from control theory, as the wikipedia page says. It means how much can you understand about the inner workings of your software, from observing its outputs. I like to think of it as, your ability to understand what is happening at any given time *without* attaching a debugger to a process/ strace for systems, gdb for systems, etc how is this diﬀerent from monitorign? Well, it’s a good idea to separate them in your mind, because there are a lot of really well established best practices for monitoring that are not true for observability, and i don’t want to muddy the water too much.

Monitoring The system as black box magic. Thresholds, alerts, watching
the health of a system by checking for a long list of symptoms. Observability The system is interrogatable and understandable at the right level of abstraction. Can you reason about the system by observing its outputs? observability is a term taken from control theory, and it’s gaining traction because it’s really about an ultra-rich debugging, and it requires you to open the hood and tinker with the internals to make them report their state. Because of that, I associate it more with software engineers, in fact that’s who we’re building for. tho ops these days are also often software engineers, and vice versa. everyone here is working on platforms and APIs, so it's diﬀerent in a way that PARTICULARLY matters to you.

You have an observable system when your team can quickly
and reliably track down any new problem without anticipating it or shipping new code. i would describe it as a system where you and your team can regularly track down the root cause of unknown-unknown problems. Known-unknowns would be like, “oh shit, mysql is running out of connections. Argh, this must be the unicorn restart bug biting us again, can someone go clean it up by hand, or blacklist some app servers?” It doesn’t mean you know what to do about it, but it does mean you know you don’t have to spend hours or days trying to figure out what the problem is. In general, you only have to encounter a problem once for it to become a known unknown, if your eng culture is good and you’re good at retrospectives and sharing. We learned this the hard way at Parse. We had some of the best eng in the world, and yet, by the tie we got acquired, we had built a system that was effectively undebuggable. We had a traditional monitoring stack, then we got our stuff into facebook’s scuba, and it changed our lives. We went from taking hours or days to debug any unique user problem, to taking minutes or seconds. I realized afterwards that this was the difference between monitoring and observability.

We don’t *know* what the questions are, all we have
are unreliable symptoms or reports. Complexity is exploding everywhere, but our tools are designed for a predictable world. As soon as we know the question, we usually know the answer too. And yet, the tools we have to understand and support these systems were mostly designed and invented in the LAMP stack era. They’re great tools! Splunk, DataDog, etc — they’re terriﬁc, they just keep getting better and better and richer and more mature. But in many critical ways they reﬂect assumptions that were true when they were designed. Assumptions like, “there will always be a server” or a host to collect and sort by. Or like, the idea that doing a deep dive on a single host to debug a problem is a really valuable thing to do. Or like the idea that all the bazillion things in /proc are the most meaningful thing you can explore in order to track down a hard problem. Doing even simple basic things like service-level awareness that’s agnostic to hosts can be oddly challenging. Other times the ‘tell’ is that it relies on schemas, or indexes, or expects you to jump around from tool to tool as you’re exploring your system with no way to connect the dots. Or that business intelligence data is completely segregated away from system and performance data. Or that it still accepts unstructured strings as events. Our tools are designed to answer known questions fastre. They do a good job of that. But our new problems are mostly that of unknown-unknowns. Let me illustrate this with an example: “your photos are loading slowly.”

Parse, 2015 LAMP monitoring => observability known unknowns => unknown
unknowns LAMP stack => distributed systems and now the systems that many of us are running resemble the national power grid. especially if you’re running a platform, because you’re inviting so much local chaos. or google, or facebook. we are gonna refer to these as the LAMP stack and micro services tacks, just for convenience. even tho not entirely fair or true

“Complexity is increasing” - Science By my precise calculations, you
can see that, the complexity of infrastructure and storage options will be incomprehensible by any human on earth in mmmm six months, give or take. and you KNOW IT’S TRUE, I HAVE A GRAPH. We toss the word complexity around a lot, but what does it actually mean? What *IS* complexity of systems?

monolith => microservices “the database” => polyglot persistence users =>
developers single tenant => multi tenancy app could reason about => def cannot reason about test-driven development => o11y-driven development . Parallel trends: at parse we were doing micro service before they were called microservices. dd dWith microservices, you’re taking the functionality that was in your code base, and you’re splitting it up across the network. You have to hop boundaries of services, logical deploy units, etc. You can’t debug your code, you have to debug your *systems*. instead of having *a* database that you got incredibly familiar with and could smell the problem in, you probably have a bunch of storage types, and you have to debug them more naively, the same way you do your own code. instead of users, you have developers. if they’re writing their own code and uploading it to your systems, if they’re writing their own queries… then it’s a platform. chaos in your system means unknown-unknowns to debug. the more creativity and functionality you have given your users, the harder it will be to debug. every system has points of multitenancy, of shared resources. Those will be b y far the hardest for you to debug, and the most important! because any one user has the potential to affect all your other customers. if the question is whether that customer or everyone else is more important, you know the answer. yet determining this can be extremely difficult. systems have reached a level of complexity that you just can’t keep in your head. and you shouldn’t try. this is why we have tools. platforms feel this first, but you’re lucky, you’re on the bleeding edge of a change everyone is subject to.

Welcome to distributed systems. it’s probably ﬁne.

Many catastrophic states exist at any given time. Your system
is never entirely ‘up’ The ﬁrst lesson of distributed systems and complex systems in general is that they are never truly “up”. You know hat good feeling you get when you look at your dashboard and see green everywhere? It’s a big fucking lie. And it’s a lie that turns toxic with complex distributed systems, because you start to believe that if you haven’t graphed it, it doesn’t exist.

Distributed systems are particularly hostile to being cloned or imitated
(or monitored). (clients, concurrency, chaotic traffic patterns, edge cases …) You can't spin up a copy of Facebook. You can't spin up a copy of the national power grid. Some things just aren't amenable to cloning. And that's fine. You simply can't usefully mimic the qualities of size and chaos that tease out the long, thin tail of bugs or behaviors you care about. Facebook doesn't try to spin up a copy of Facebook either. They invest in the tools that allow thousands and thousands of engineers to deploy safely to production every day and observe users interacting with the code they wrote. So does Netflix. So does everyone who is fortunate enough to outgrow the delusion that this is a tractable problem.

Distributed systems have an inﬁnitely long list of almost-impossible failure
scenarios that make staging environments particularly worthless. this is a black hole for engineering time so what do you test in production, what do you test in staging?

but here’s where i’m going to make an inﬂammatory statement
that i nevertheless stand completely behind: you must test in production. every single software engineer. no exceptions. you mustiest with real users, real patterns, real dat=, real concurrency, etc. Until you’ve done that, you haven’t really tested.

Real data Real users Real traﬃc Real scale Real concurrency
Real network Real deploys Real unknown-unknowns you have to talk to real services, real data, real users, real network. and you have to do it without impacting your customers or cotenants. you can either accept this and put the work into doing it well, or deny it and continue to do a shit job of it. but you have limited engineering cycles, you may as well spend them wisely. test in prod has gotten a bad rap. i blame this guys:

I blame this guy: unfortunately, it’s wrong, it’s misleading, and
it makes people waste energy in the wrong place. First of all, it’s a false dichotomy. I feel like there’s this false implication that testing in production somehow implies that you *don’t* test in other ways.

testing in production doesn’t mean you can’t or don’t test
the shit out of it before you ship. you know? it’s not like there’s a limit on the number of tests you can do. But you do have a limited amount of energy and time. Maybe there are ways we can group the type of testing that we do, and talk about where our energy SHOULD go. People love to tell you what not to do, it’s like our favorite thing to do in engineering. It’s harder to say what you should do. but what if …

how they think we are how we really are but
nobody likes the phrase “test in prod”. it’s scary, managers hate it. that’s ﬁne! let’s use a friendlier phrase: observability-driven development no one will ever know. :)

unit tests integration tests functional tests basic failover test before
prod: … the basics. the simple stuﬀ. known-unknowns this is table stakes, this is boring, right?

behavioral tests experiments load tests (!!) edge cases canaries rolling
deploys multi-region test in prod: These are actually the only interesting problems out there. If you’re wanting to test a feature, maybe you want to just roll it out to 1% of brazilian users. not all failures are created equal

test in staging? meh Your system is never “up”, it
exists in a partially degraded state at all times, you just don’t know about it, because your tools aren’t good enough, because you haven’t been investing in understanding what’s happening in production right now as much as you probably should be. if your have a wall of green dashboards, your tools aren’t good enough, i guarantee you don’t have 0 problems. This is another place that distributed systems provide useful context. DS is all about the unknown unknowns, about surrendering to the fundamental uncertainty that is computers.

unit tests integration tests functional tests “What happens when …”
(you know the answer) “What happens when …” (you don’t) behavioral tests experiments load tests (!!) edge cases canaries rolling deploys multi-region test before prod: test in prod: they’re not mutually exclusive. they are complementary. they are both absolutely necessary. we spend more time focusing on the known unknowns because we like to pretend we can exert some control over the universe. :) when you know the answer, you can add a test for it, and you should. when you don’t know the answer, you should experiment and learn what you can under controlled conditions. whether that’s a feature, or an infra-wide

Only production is production. You can ONLY verify the deploy
for any env by deploying to that env as anyone who’s ever deployed to producktion with a k knows, each deploy is testing a unique combination of the software you’re deploying, the deploy tools themselves and the state of the target environment. You can test your ass oﬀ on a staging cluster, then click the button and realize it does something completely diﬀerent for prod. The only way to make it safe is to do it often.

1. Every deploy is a *unique* exercise of your process+ 
code+system 2. Deploy scripts are production code. If you’re using fabric or capistrano, this means you have fab/cap in production. It's easy to get dragged down into bikeshedding about cloning environments and miss the real point: Only production is production, and every time you deploy there you are testing a unique combination of deploy code + software + environment. (Just ask anyone who's ever conﬁdently deployed to "Staging", and then "Producktion" (sic).) Deploy code rots at least as fast as your other code. Deploy as often as you can!

Why do people sink so much time into staging, when
they can’t even tell if their own production environment is healthy or not? because their o11y is so shitty they literally have no idea what’s going on in production, usually You're shipping code every day and causing self-inﬂicted damage on the regular, and you can't tell what it's doing before, during, or after. It's not the breaking stuﬀ that's the problem; you can break things safely. It's the second part— not knowing what it's doing—that's not OK.

That energy is better used elsewhere: Production. You can catch
80% of the bugs with 20% of the effort. And you should. @caitie’s PWL talk: https://youtu.be/-3tw2MYYT0Q You have a limited amount of energy, and staging copies are a quicksand / blackhole for engineering energy to vanish down and never emerge again. There's a lot of value in testing... to a point. But if you can catch 80% to 90% of the bugs with 10% to 20% of the eﬀort—and you can—the rest is more usefully poured into making your systems resilient, not preventing failure. caitie mccaﬀrey has a great PWL talk here about how the majority of errors are simple ones that can be caught by tests. you should do that! there’s no excuse for shipping really basic errors to prod. Most errors are simple ones

feature ﬂags (launch darkly) high cardinality tooling (honeycomb) canary canary
canaries, shadow systems (goturbine, linkerd) capture/replay for databases (apiary, percona) Do it safely. These are actually the only interesting problems out there. If you’re wanting to test a feature, maybe you want to just roll it out to 1% of brazilian users. not all failures are created equal

Observability driven development: using real production data to decide what
to build, and then validating and verifying that you shipped what you meant to ship, and it had the impact you desired, using real production data in real time.

Now let’s talk about DEVOPS and some real world examples!

wave1: “dear ops, plz learn to write code” wave2: “dear
software engineers … your turn” DevOps but what does that *mean*? testing in production, odd? we’ll get down and dirty with real examples, but first at a very high level: it’s fulfilling the promises of devops, and creating software *owners*. lwave one of devops was all about “dear ops, learn to write code”. wave two is all about “ok software negineers: your turn”. software owners are people who have access to, and a basic understanding of, the full software lifecycle. from develop to debug to deploy. science shows us that teams that effectively develop a culture of software ownership have massively, categorically better results than teams that have walls and strict hierarchies or roles.

▸ Design documents ▸ Architecture review ▸ Test-driven development ▸
Integration tests ▸ Code review ▸ Continuous integration ▸ Continuous deployment ▸ ▸ (Wait for exception  tracker to complain) DEV The  Software Process these days, software development is usually super customer-focused. there’s a whole suite of stuﬀ we’re super diligent about: ✳ and we celebrate shipping!! … and we forget about it.

"Works on my machine" DEV "The only good diff is
a red diff" OPS "What does it look like for the user?" - until an ops person comes knocking on our door, grumpy and complaining about something breaking. ✳ which can make devs respond like this ✳ - :) we all know folks, on either side, who talk like this. - what happened to that diligence? that curiosity? that "oh, i should check these edge cases"? - when we focus on the "us" vs "them" of this conflict, we get folks pointing fingers and hunkering down on their POV. - but what brings us together, in the end, is this contract to the users of our service ✳ — this is what unifies us. so no matter what, we’re in this together, and need to start thinking from the perspective of users in production, instead.

OPS DEV ▸ How to build those features / fix
those bugs ▸ How features and fixes are scoped ▸ How to verify correctness or completion ▸ How to roll out that feature or fix from the developers’ perspective, we can use data about users in production to really explore what our services are doing. and as a result, we can inform not only what features we build / bugs we fix, but also:

OPS DEV During the development process, we've got lots of
questions about our system that don't smell like "production monitoring" questions, but should totally rely on production data to answer them. Instead, they're about hypotheticals. Or speciﬁc customer segments. Or "what does 'normal' even mean for this system?" (I’d actually posit that any sort of performance work, you just can't do reliably without actually measuring behavior in the wild)

▸How’s our load? Is it spread reasonably evenly across our
Kafka partitions? ▸Did latency increase in our API server? Is our new /batch endpoint performing well? ▸How did those recent memory optimizations affect our query- serving capacity? OPS DEV Here are some questions that look an awful lot like standard, production systems-y, ops-y questions… But since we're a SaaS platform, and because we sell to businesses who rely on us to understand their systems, we have a lot of diﬀerent workloads coming in, each of which we have to be able to understand in isolation from the rest. Our nouns, our things-to-drill-down-by, are always customers.

▸How’s our load? Are high-volume customers spread reasonably evenly across
our Kafka partitions? ▸Did latency increase in our API server? Which customers benefit most from our new /batch endpoint? ▸How did those recent memory optimizations affect our query- serving capacity for customers with string-heavy payloads? OPS DEV + … and once you start talking about customers, and their workloads, and how that might impact or affect what the code does… then devs start to care about the answers to these, too. Your nouns might be different. You might care about expensive Mongo queries, or high-value shopping carts, or the countries responsible for processing transactions. But the premise is the same — that the most painful problems to debug, the most annoying things to realize, the most important edge cases to understand, often originate from that one outlier doing something unexpected, while the code expected something else. and these are problems that ops people often find, but devs have to help fix.

APM FOR DEVS APM FOR DEVS MONITORING for ops MONITORING
for ops here are some examples of how folks currently answer these questions. some folks would label this side monitoring, or this side APM; some would label this side "for ops people" and this side "for developers"; i say it doesn't matter. the goal here is observability. the goal is not a form factor but to be able to answer questions with data. and these labels invite folks to erect walls, to create "us" vs "them" these tools want to create a divide, but—remember?—we’re all in this together. we're all software owners, trying to provide a service to our customers. (can i sketch out more attributes of either side, showing that both sides provide info / encourage activity that developers will beneﬁt from?)

The First Wave of DevOps:  teaching ops folks to code
The Second Wave of DevOps:  teaching devs to own code in production OPS DEV + - We’re here today because this wall that folks think of—between "works on my machine" and "it’s over the wall now and in production"—has to come down. We’re entering a world with containers and microservices and serverless systems, and there’s too much code in production for developers to not take ownership of what gets deployed. the ﬁrst wave of devops was about teaching ops people to code. teaching them to automate their workﬂows, to do more development. the second wave is going to have to be about teaching devs to get comfortable owning their code in production. and observability—teaching and encouraging devs to look at prod, to poke at prod, to understand prod through data—is how we get there.

which one is easier to map to the code i'm
writing? tools matter. the data we feed into those tools matter. and our tools for understanding prod can’t just talk in terms of CPU and memory and disk space, not if we want devs—who think in terms of build IDs and codepaths and customers—to use them too.

09:00 (THESIS): observability should be a core part of the
development process. As the software is being built — not as it’s being evaluated for reliability, or right before it’s deployed — we need to be feeding useful information into each step of that process. As much as testing or documentation, understanding what our code does in production teaches us how to build better software. All the time. Observability isn’t just for ﬁnding out when things are broken or going wrong — it’s for understanding your software. "what does normal look like" e.g. "oh, these errors are normal"—the logs always look like this; that raised line is a red herring.

DEV DEBUG & decide  what to build BUILD the darn
thing VERIFY that it works (on my machine) WATCH it for errors VERIFY that it works (in production) ▸Design documents ▸Architecture review ▸Test-driven development ▸Integration tests ▸Code review ▸Continuous integration ▸Continuous deployment ▸ ▸(Wait for exception  tracker to complain) The  Software Process let's go back to that slide of all the things that developers do ✳ and break them down into some more manageable steps ✳ (PAUSE at VERIFY WFM) what informs each of these? is it just our fantastic developer intuition? do we have a PM feeding us instructions? ... i mean, the answer to both of those might be yes. but i sure hope we have some real data coming in as well! let’s take a look at some real-world examples of what using data in each of these stages of the development process might look like.

PRODUCTION SYSTEMS DEBUG 11:00 This is the most boring case
for me, because there’s really no shortage of people and software telling developers what needs to be ﬁxed: alerts, exceptions, people on Twitter; they all help identify work that needs to be done. Data comes in to play, though, in reﬁning that directive — when you’re going from a high-level "elevated latency" alert to "X part of your system is misbehaving under Y circumstances."

customer query (e.g.): COUNT of events where ip_address="52.204.108.25" DEBUG -
this might be a classic example, but i’ll give it anyway, since, well, at least one part of it had a happy ending. - honeycomb ingests your events, then lets you query over them. we optimize for really fast analytics on our column store, but we also support folks asking to view the full events that make up a given query. - one of our very large customers was able to run a simple COUNT query just fine, but was experiencing timeouts when flipping to raw data mode—even when the two queries were theoretically drawing from the same set of rows. - we were able to jump into our data, figure out what was "normal" in aggregate, then zoom in on just the two queries that were under examination. - without going into detail, since my time is limited :), we were able to figure out where we were doing a whole bunch of extra work in this very specific case, and add to our benchmarks for the future. there are certain things that can only be debugged or diagnosed in production, at scale. and by having all of the information we needed, from a developer’s perspective, in prod—we could start high level, zoom in on it, then essentially replay the past.

DEBUG - this is another example of the sort of
high-level analysis we were able to do on this one customer’s data. we were noticing that a small number of the timestamps they sent us were splayed all over the place, and also causing degraded query performance—and, somehow, the bottom graph seems to show one particular availability zone resulting in… really weird latencies getting sent in. - we were able to go back to the customer with something super speciﬁc for them to look into

THE DARN THING BUILD 15:00 There's a difference between knowing
that something should be changed or built and knowing how it should be. Understanding the potential impact of a change we're making—especially something that'll have a direct, obvious impact on users—lets us bring data into the decisionmaking process. By learning about what "normal" is (or at least—what "reality" is), we can figure out whether our fix is actually a fix or not.

BUILD ? - when we were first starting out, we
knew we needed to build in real rate limiting into our API—our previous band-aidy solution involved in-process cache, where each server kiiind of just approximated the rate limit individually. - while this worked well enough, it meant that as our traffic grew, there was an unwanted correlation between which server a single request hit, and which customers were actually being rate limited. we needed to spin up a shared cache and clean up this tech debt ✳ - but instead of just blindly making the change, we wanted to be confident it’d actually behave correctly. while we perfected the logic and heuristics, we could start capturing data in prod to simulate the logic, to see what the steady-state behavior would be, without impacting customers. - So alongside the logic in the API server that calculated a particular requests's rate limit, we added a bit that tracked whether or not a given request would have been rate limited by the new algorithm.

BUILD DID HIT RATE LIMIT WOULD HAVE HIT RATE LIMIT
- And then we could visualize the glorious new hypothetical future - We could see that - as expected - the new rate limiting algorithm was more strict in a number of places, especially around large spikes. - but we now had enough information here to examine each case individually and assess whether the change was what we intended! - and the ability to actually identify which customers would have been rate limited… allowed us to work with each customer to help them understand the change in behavior.

BUILD - 10:00 here’s another example. we wanted to introduce
some basic compression into our string columns, and hypothesized that datasets with lots of unique strings (high cardinality) would benefit less from compression. - before we started building, we gathered data on cardinality characteristics of string files in production, today. - the conclusion we reached: most but not all string columns are low-cardinality, so most but not all columns will benefit from compression—was able to set expectations before we jumped in and did the time-consuming implementation of our new compression scheme.

BUILD with debug statements in prod traffic i love these
stories because they’re like — debug statements in your prod traffic. By having a flow where it’s lightweight and natural for software engineers to add these diagnostic bits and suddenly be able to describe of the execution of your logic in the wild, we can make more informed decisions and deliver better experiences to our customers.

ON MY MACHINE VERIFY testing is great and all, but
how do we select our test cases? a lot of this tends to be "intuition," or guessing at edge cases that matter. why not use prod data to determine which test cases are worthwhile?

it’s pretty nice to be able to go take a
look at your prod data to ask, hey… across all of the payloads hitting our API, what’s the distribution of the number of columns within each payload? what are the edge cases we should make sure to handle? note: this right here... is why pre-production testing is never going to be enough. the test cases are determined by humans, and even though we're using "data in the wild" to inform those, who knows what crazy things will happen next week, next month, next year?

IN PROD VERIFY   - we love the concept of
"testing in production." - testing on your machine is all well and good, but what do we do when we aren't quiiite sure about our change, or want to check its impact in some lightweight or temporary way?

FEATURE FLAGS VERIFY   - 21:00 We love feature flags
for letting us essentially test in production while we're still making sure the change is one we're happy with shipping. - Pairing feature flags with our observability tooling lets us get incredibly fine-grained visibility into our code's impact. - Being able to get these arbitrary feature flags into our dogfood cluster, means we can look at our normal top-level metrics, segmented by flags. - This becomes incredibly powerful when we do things like turn a feature flag on for a very small amount of traffic, while still retaining the ability to slice by hostname, customer, whatever.

VERIFY (PROD) // Dataset-keyed feature flags FlagColdStorageDataset = BoolDatasetFlag{"cold-storage-dataset", false}
FlagColdStorageQuery = BoolDatasetFlag{"cold-storage-query", true} // note default true FlagHiresInternalHeatmaps = BoolDatasetFlag{"hires-internal-heatmaps", false} FlagTwoPassHeatmaps = BoolDatasetFlag{"two-pass-heatmaps", false} FlagVarstringDictWrite = BoolDatasetFlag{"varstring-compression-write", false} FlagVarstringDictRead = BoolDatasetFlag{"varstring-compression-read", false} FEATURE FLAGS - The best part about this approach is the ability for your developers to define these ephemeral, specific segments of your data freely to answer these sorts of... ephemeral questions that pop up during development. - At this point, we actually send all live feature flags and their values along with all payloads to our storage engine dataset, so that as we observe future experiments, we don't have to stop first and remember to add the feature flag to a metric name or dashboard -- they're all just right there, mapped to the flags that we as software engineers are already thinking about.

VERIFY (PROD) STORY TIME: we feature flag heavily. lots for
UX reasons, but also anytime we make changes to our storage layer -- these are hard to test locally (both because of scale and diversity of types of data). About a year ago, we were making changes to our string storage format. It was intended to make things more flexible but we wanted to make sure that it didn’t impact performance significantly. - It was a twitchy change, and something that we wanted to roll out very carefully, one dataset at a time—this would give us greater isolation and ability to compare performance impact that if we’d done it one storage host at a time. - What feature flags and this observability mindset let us do—we could go from these top-level metrics around volume and cumulative speed of these writes

VERIFY (PROD) - … to being able to inspect "COUNT
and AVG(cumulative write latency) for each write for datasets in the experiment, vs metrics for datasets not in the experiment." - This graph’s purple line actually shows writes that are ﬂagged in to the new code, and seems to somehow show latency going up for both groups when datasets are ﬂagged in.

VERIFY (PROD) - 26:00 … Turns out there was just
a correlation between write speed and another factor (size of the write) that wasn’t shown on the previous graph. By poking around and adding that one in, we could pull those lines out, recognize that they’re just steady-state slower than the rest, and convince ourselves that the storage format change was stable after all. - (This example actually made it onto our blog, and you can ﬁnd it https://honeycomb.io/blog/2017/08/how-honeycomb-uses-honeycomb-part-5-the-correlations-are- not-what-they-seem/)

IS IT STILL WORKING? LET’S WATCH 24:00 I almost left
this stage out of the deck, thinking you all know how to do this - but I want to press on this, because it’s a great example of the ops/dev split I touched on earlier. - What’s the biggest source of chaos in systems? Humans. Software Engineers. Us, pushing code. :) Our tools have to be able to reﬂect this chaos - accurately. - The state of the art for this these days seems to be drawing a line to mark a deploy, or matching up timestamps against some other record — but anyone who’s spent enough time deploying their code knows that deploys aren't instantaneous

WATCH - For example, we do rolling deploys of our
storage nodes - and here you can see the progression of the deploy, the switchover, happen.

WATCH - Here’s a more dramatic example. I was watching
a deploy go out containing some changes that should have had no impact on performance—but saw the average latency for query reads go up a whole bunch. - I was able to pinpoint the build where that started, go figure out what else made it in (turns out another storage layer change landed :)), and give the engineer a heads up. - And, you can see where they reverted their change and latencies largely returned to normal. - If you think about the sorts of observability tools that developers tend to rely on the most, because they speak their language, you think of… what? Exception trackers and crash reporting tools. - Workflows to associate degenerate behavior with an outdated build are half the value of nice exception trackers — and if we just remember to fold some of those nouns, some of that vocabulary, into our current tools, then maybe it can start to be more natural to reach for a different set of tools. - https://ui-dogfood.honeycomb.io/dogfood/datasets/retriever-query/result/46UgNbvzkbu

▸ Form hypotheses about what code will do in prod
▸ Add/tweak instrumentation as necessary ▸ Query data to (in)validate hypotheses ▸ Take action (and repeat as necessary) OPS DEV That's how we help bridge this "ops" vs "dev" gap, and that's how we devs start thinking about what happens after we ship:❇

DEBUG VERIFY (WFM ) VERIFY (PROD) BUILD ASK NEW QUESTIONS
WATCH SHIP BETTER SOFTWARE Instrumentation and observability shouldn’t just be checkboxes for the "end" of the dev cycle; they should be embedded into each stage of the development process and continually checked to keep us grounded with "what’s normal" / "what’s really happening." By capturing more, lightweight, transient information in prod — even before we ﬂip a switch and ship real code — we’ll be better-informed, make better choices, and deliver better experiences for our users.

Exceptions! Git Commits! Customer  complaints! IDEs! Alerts! Overloaded  hosts! Metrics!
Load  balancers! End-to-end Checks Observability OPS DEV Yep, the things that we deal with day-to-day may be diﬀerent, depending upon which end of the spectrum we fall on… but in the end, we build bridges that bring us together. - things like e2e checks => the ideal way to say, "hey: all i care about is this core functionality of our product. does it work?" - transcends devs or ops—potentially use dev skills and ops techniques to make sure things keep working. - observability => asking new questions about our systems. Not just for ops.

TAKING THE FIRST FEW STEPS ▸ Start at the edge
with basic, common attributes (e.g. HTTP) - 36:00 Okay. So. How do we get there, from nothing, without getting lost in a giant rabbit hole? ✳ - Start at the edge and build it up: in our webapp, for a long time the least sensitive part of our system, we just stuck in a middleware that turned each HTTP request into an event for logging. - But that... was actually pretty great! We were able to toss it into dogfood there and immediately answer super-high-level questions like: "if I make this change, will anybody care? Is anybody hitting this route regularly?"

with basic, common attributes (e.g. HTTP) ▸ Business-relevant or infrastructure-speciﬁc characteristics (e.g. customer ID, DB replica set) - On top of those standard HTTP attributes like URL, status code, latency, and build ID, we made sure to include our business-relevant characteristics (like customer id and dataset id) and any infrastructure-speciﬁc characteristics (like the kafka partition an API write was going to, for example, or storage nodes queried).

with basic, common attributes (e.g. HTTP) ▸ Business-relevant or infrastructure-specific characteristics (e.g. customer ID, DB replica set) ▸ Temporary additional fields for validating hypotheses - And this sets you up for the sort of ad-hoc, ephemeral queries that drive development forward. - Remember, many of those examples we ran through earlier -- the rate limiting example, or the feature flagged storage example -- relied on data that we had to add, often in parallel to the actual code being written. The simpler and smoother it is for engineers to add this sort of metadata on the fly, the more we’ll be able to use it to make sure we're building the right things.

with basic, common attributes (e.g. HTTP) ▸ Business-relevant or infrastructure-specific characteristics (e.g. customer ID, DB replica set) ▸ Temporary additional fields for validating hypotheses ▸ Prune stale fields (if necessary) - Some fields (e.g. timers) we'll just leave in place in case they're useful in the future. Other fields do eventually get pruned when they're no longer useful or are noisy, the same way feature flags do.

SOME BEST PRACTICES ▸ Contextual, structured data - Some things
that have helped us keep our data clean and easy to work with  - Contextual, structured data  - In our API code, we attach a map to our request contexts, pick up attributes or tags along the way as code is executed, then send the payload oﬀ upon http response - This ensures collection of a ton of useful context alongside the metrics we care about visualizing, as well as providing a consistent mental model of request -> single unit of work, or an event

SOME BEST PRACTICES ▸ Contextual, structured data ▸ Common set
of nouns and consistent naming - Establishing a common set of nouns that we care about, and being consistent with our naming patterns (e.g. always àpp_id`, not sometimes àppId` or àpplication-id`) will help make Future Us not hate Past Us. - Those common "business-relevant characteristics" I mentioned? Those map to a consistent set of metadata on all events we send (hostname, customer ID, dataset ID, build ID, etc), no matter which part of the system we’re looking at.

of nouns and consistent naming ▸ Don't be dogmatic; let the use case dictate the ingest pattern - Ultimately, instrument services according to your need. Think about how you're likely to read the data and how important certain attributes will be relative to others.

of nouns and consistent naming ▸ Don't be dogmatic; let the use case dictate the ingest pattern ▸ e.g. instrumenting individual reads while batching writes - For example, say you have a service where you care a lot about read performance characteristics, so we capture lots of timers and information about the shape of the read, for every single read request. - On the other hand, we're generally interested in the health of our datastore's write path, but it's a high-throughput codepath that has been really highly optimized -- so it's a rare example of being a) very sensitive to the performance overhead of capturing events, and b) less interested in granular observability (especially because we already learned whatever we needed to at the API layer). As a result, we actually batch up writes per dataset (still a pretty high-cardinality ﬁeld!) and do some pre- aggregation to make life saner for everyone.

first pass: - server_hostname - method - url - build_id
- remote_addr - request_id - status - x_forwarded_for - error - event_time - team_id - payload_size - sample_rate then we added: - dropped - get_schema_dur_ms - protobuf_encoding_dur_ms - kafka_write_dur_ms - request_dur_ms - json_decoding_dur_ms +others a couple of days later, we added: - offset - kafka_topic - chosen_partition AN EXAMPLE SCHEMA EVOLUTION And, just because I always like to show folks what our experience is really like, here’s how the schema has evolved for us observing our own API:

first pass: - server_hostname - method - url - build_id
- remote_addr - request_id - status - x_forwarded_for - error - event_time - team_id - payload_size - sample_rate then we added: - dropped - get_schema_dur_ms - protobuf_encoding_dur_ms - kafka_write_dur_ms - request_dur_ms - json_decoding_dur_ms +others a couple of days later, we added: - offset - kafka_topic - chosen_partition after that: - memory_inuse - num_goroutines a week after that: - warning - drop_reason and on and on, adding 2-3 fields every couple of weeks: - user_agent - unknown_columns - dataset_partitions - dataset_id - dataset_name - api_version - create_marker_dur_ms - marker_id - nil_value_for_columns - batch - gzipped - batch_datapoint_lens - batch_num_datasets - batch_process_datapoints_dur_ms AN EXAMPLE SCHEMA EVOLUTION

devs, your mission: ▸ Stop writing software based on intuition,
start backing it up with data ▸ Teach observability tools to speak more than "Ops" ▸ ??? (← ask lots of questions and validate hypotheses) ▸ Proﬁt! 39:00 ## In Conclusion - I don't care what tools you use, we should all be doing these things. ✳ - big or small, compliance or no compliance — the folks involved in shipping software should understand the behavior of our systems in production. ✳ - Software developers, we should own observability — because we have the most to gain! ✳   - Observability should be a core part of how we understand what to build, how to build it, and who we’re building it for. ✳ - We have the power to use data, be better engineers, and ship better software

Tearing down this wall is the work of a generation
but complex systems demand software owners, not writers or operators or nannies or dilettantes or absentee parents or … lwave one of devops was all about “dear ops, learn to write code”. wave two is all about “ok software negineers: your turn”. software owners are people who have access to, and a basic understanding of, the full software lifecycle. from develop to debug to deploy. science shows us that teams that effectively develop a culture of software ownership have massively, categorically better results than teams that have walls and strict hierarchies or roles. writing code doesn’t make you an owner any more than donating sperm makes you a parent. Software demands owners, not operators: and not absentee parents either. It is genuinely terrifying how many senior software engineers do not understand the systems they have built -- are building! Some don't even seem to believe their systems can or should be comprehensible. They write the code that passes the tests and they hand it over the wall to the ops teams with a shudder and everybody agrees to pretend it is a black box forever <insert goat sacrifice> Dude, the reason I feel this strongly is because of my experience at Parse. We built a marvelous, rich platform. And we couldn't debug it. Some of the best engineers in the world, and we were spending all our time chasing down one offs instead of building the platform. This undebuggability is a function of complexity and sheer number of possible root causes. Everything is am unknown unknown, so you never catch up using traditional tools. Platforms always feel this first, because you're inviting user chaos on to your backend. Hard to fence. Tearing down the fucking wall is the wotk of a generation, but the only answer is instrumentation and observability -- developing software that explains itself from the inside out, and testing that what you actually shipped is what you meant to ship, every time. Shepherding your code through its challenging fourth trimester, when baby code meets cold cruel world. You can't just drop that shit off at the fire station and hope it survives and doesn't kill anyone.

… but it’s a better world. we got this.

ASK new questions SHIP BETTER SOFTWARE thanks! Charity Majors @mipsytipsy

the shift from monitoring to observability mirrors the shifty from
monoliths to microservices , known unknowns to unknown unknowns. the mental shift is from operating to owning, and from preventing failure to embracing failure and making it your friend. dear software engineers, it’s not so bad here. the dopamine hits are huge. don’t fear the reaper. don’t fear producgtion. create a production system you don’t have to fear. give everybody the ability and encouragement to play with prod. get used to playing in the sandbox. know what normal looks like. know how to debug your own shit. know how to get your hands dirty and play around. know how to develop by ﬁrst adding observability, then developing to spec. get used to asking lots of small questionsa bout your systems, all the time.

EXAMPLES of ODD EXAMPLES of ODD - parse golang rewrite,
splitter, shadow - should we even build this feature? who would be impacted if we rolled out this change? - ship instrumentation all the time. get used to teasing and playing with the data

39:00 ## In Conclusion - I don't care what tools
you use, we should all be doing these things. ✳ - big or small, compliance or no compliance — the folks involved in shipping software should understand the behavior of our systems in production. ✳ - Software developers, we should own observability — because we have the most to gain! ✳   - Observability should be a core part of how we understand what to build, how to build it, and who we’re building it for. ✳ - We have the power to use data, be better engineers, and ship better software

Observability-Driven Development

Observability-Driven Development

More Decks by Charity Majors

Other Decks in Technology

Featured

Transcript