Observability and the Glorious Future

V6-21 Charity Majors (slides by Liz Fong-Jones) CTO, Honeycomb @mipsytipsy
at Infrastructure & Ops Superstream: Observability Observability And the Glorious Future w/ illustrations by @emilywithcurls! “You may experience a bit of cognitive dissonance during this talk, since you are probably familiar with liz’s slide style, even handedness, and general diplomatic approach. I have tried to play liz on TV, it didn’t fool anybody. So anytime you see an annoying little pony pop up mouthing off: don’t blame liz.”

V6-21 Observability is evolving quickly. 2 “Your bugs are evolving
faster” Everybody here has heard me fucking talk about observability and what it is. So instead, we’re going to walk you through the future we’re already living in at honeycomb, o11y is the ability to understand our systems without deploying new instrumentation The o11y space has a lot of product requirements that are evolving quickly - large volume of data to analyze - there are increasing demands users have on tooling

V6-21 3 INSTRUMENT QUERY OPERATIONAL RESILIENCE MANAGED TECH DEBT QUALITY
CODE PREDICTABLE RELEASE USER INSIGHT Outcomes Actions DATA And the problem space is complex. Anyone who tells you that you can just “buy their tool” and get a high- performing engineering team, is selling you something stupid We care about predictable releases, quality code, managing tech debt, operational resilience, user insights. The Observability isn’t frosting you put on the cake after you bake it. It’s about ensuring that your code is written correctly, performing well, doing its job for each and every user Code goes in from the IDE, and comes out your o11y tool How do you get developers to instrument their code? How do you store the metadata about the data, which may be multiple sizes of the data? And none of it matters if you can’t actually ask the right questions when you need to.

V6-21 Practitioners need velocity, reliability, & scalability. 4 You DO
NOT ACTUALLY KNOW if your code is working or not until you have observed it in production A lot of people seem to feel like these are in tension with each other. Product velocity vs reliability or scalability

V6-21 A small but growing team builds Honeycomb. 5 At
Honeycomb we’re a small engineering team, so we have to be very deliberate about where we invest our time, and have automation that speeds us up rather than slowing us down We have about 100 people now, and 40 engineers, which is 4x as many as we had two years ago. We’re 6 years in now, and for the first 4 years we had 4-10 engineers. Sales used to beg me not to tell anyone how few engineers we had, whereas i always wanted to shout it from the rooftops. Can you BELIEVE the shit that we have built and how quickly we can move? I LOVED it when people would gawp and say they thought we had fifty engineers.

V6-21 We deploy with confidence. 6 One of the things
that has always helped us compete is that we don’t have to think about deploys. You merge some code, it gets rolled out to dogfood, prod etc automatically On top of that, we comfortably deploy on Fridays. Obviously. Why would we sacrifice 20% of our velocity? Worse yet, why would we let merges pile up for monday? We deploy every weekday and avoid deploying on weekends

V6-21 7 One of the things that has always helped
us compete is that we don’t have to think about deploys. You merge some code, it gets rolled out to dogfood, prod etc automatically On top of that, we comfortably deploy on Fridays. Obviously. Why would we sacrifice 20% of our velocity? Worse yet, why would we let merges pile up for monday? We deploy every weekday and avoid deploying on weekends

V6-21 When it comes to software, speed is safety. Like
ice skating, or bicycling. Speed up, gets easier. Slow down, gets wobblier. Here’s what that looks like This graph shows the number of distinct build_ids running in our systems per day We ship between 10-14x per day This is what high agility looks like for a dozen engineers Despite this, we almost never have emergencies that need a weekend deploy. Wait, I mean BECAUSE OF THIS.

V6-21 All while traffic has surged 3-5x in a year.
I would like to remind you that we are running the combined production loads of several hundred customers. Depending on how they instrumented that code, perhaps multiples of that their traffic. We’ve been doing all this during the pandemic, while shit accelerates. And if you think this arrow is a bullshit excuse for a grahp, we’ve got better ones.

V6-21 Write workload, trailing year Writes have tripled

V6-21 Read workload, trailing year Reads have 3-5x’d This is
a lot of scaling for a team to have to do on the fly, while shipping product constantly, and also laying the foundation for future product work by refactoring and paying down debt.

V6-21 Our confidence recipe: 5:00 We talk a pretty big
game. So how do we balance all these competing priorities? How do we know where to spend our incredibly precious, scarce hours?

V6-21 Quantify reliability. 13 “Always up” isn’t a number, dude.
And if you think you’re “always up,” your telemetry is terrible. Not just tech but cultural processes that reflected our values Prioritizing high agility on the product and maintaining reliability, and figuring out that sweet spot where we maintain both

V6-21 Identify potential areas of risk. So many teams never
look at their instrumentation until something is paging them. That is why they suffer. They only respond to heart attacks instead of eating vegetables and minding their god damn cholesterol. This requires continuous improvement, which means addressing the entropy that inevitably takes hold in our systems Proactively looking at what’s slowing us down from shipping and investing our time in fixing that when it starts to have an impact If you wait for it to page you before you examine your code, it’s like counting on a quadruple bypass instead of

V6-21 Design experiments to probe risk. Outages are just experiments
you didn’t think of yet :D

V6-21 Prioritize addressing risks. Engineers run and own their own
code.

V6-21 Measuring reliability: 7:00

V6-21 How broken is “too broken”? 18 How broken is
“too broken”? How do we measure that? —-- (next: intro to SLOs) The system should survive LOTS of failures. Never alert on symptoms or disk space or CPUs.

V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Service
Level Objectives (SLOs) Define and measure success! Popularized by Google, widely adopted now! at Honeycomb we use service level objectives which represent a common language between engineering and business stakeholders we define what success means according to the business and measure it with our system telemetry throughout the lifecycle of a customer that’s how we know how well our services are doing and it’s how we measure the impact of changes

V6-21 SLOs are common language. SLOs are the APIs between
teams that allow you to budget and plan instead of reacting and arguing. Loose coupling FTW! They’re a tool we use as a team internally to talk about service health and reliability. SLOs are the API between teams. They allow you to budget and prepare instead of just reacting and arguing

V6-21 Think in terms of events in context. 21 P.S.
if you aren’t thinking in terms of (and capturing, and querying) arbitrarily-wide structured events, you are not doing observability. Rich context is the beating heart of observability. What events are flowing thru your system, and what’s all the metadata?

V6-21 Is this event good or bad? 22 [event from
above being sorted by a robot or machine into the good or bad piles]

V6-21 Honeycomb's SLOs reflect user value. 23 And the strictness
of those SLOs depends on the reliability that users expect from each service. SLOs serve no purpose unless they reflect actual customer pain and experience.

V6-21 We make systems humane to run, 24 Honeycomb’s goal
as a product is to help you run your systems humanely Without waking you up in the middle of the night For you to tear your hair out trying to figure out what’s wrong

V6-21 by ingesting telemetry, 25 The way we do that
is by ingesting your systems’ telemetry data

V6-21 enabling data exploration, 26 And then making it easy
to explore that data By asking ad-hoc, novel questions Not pre-aggregated queries but anything you might think of

V6-21 and empowering engineers. 27 And then we make your
queries run performantly enough so that you feel empowered as an engineer to understand what’s happening in your systems. Exploration requires sub-second results. Not an easy problem.

V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. What
Honeycomb does • Ingests customer’s telemetry • Indexes on every column • Enables near-real-time querying   on newly ingested data Data storage engine and analytics flow Honeycomb is a data storage engine and analytics tool we ingest our customer’s telemetry data and and then we enable fast querying on that data

V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. SLOs
are user flows Honeycomb’s SLOs • home page loads quickly (99.9%) • user-run queries are fast (99%) • customer data gets ingested fast (99.99%) SLOs are for service behavior that has customer impact at Honeycomb we want to ensure things like the in-app home page should load quickly with data that user-run queries should return results fast and that customer data we’re trying to ingest should be stored fast and successfully these are the sorts of things that our product managers and customer support teams frequently talk to engineering about However, if a customer runs a query of some crazy complexity… it can take up to 10 sec It’s ok if one fails once in a while. But our top priority is ingest. We want to get it out of our customers’ RAM and into honeycomb as quickly as possible.

V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Service-Level
Objectives 30 30 • Example Service-Level Indicators: ◦ 99.9% of queries succeed within 10 seconds over a period of 30 days. ◦ 99.99% of events are processed without error in 5ms over 30 days.   • 99.9% ≈ 43 minutes of violation in a month. • 99.99% ≈ 4.3 minutes of violation in a month. but services aren't just 100% down or 100% up. DEGRADATION IS UR FRIEND Fortunately, services are rarely 100% up or down. If services are degraded by 1%, then we have 4300 minutes to investigate and fix the problem

V6-21 Data-driven decisions and tradeoffs. 31 Charity: making failure budgets
explicit.

V6-21 Should we invest in more reliability? 32

V6-21 Is it safe to do this risky experiment? 33
Too much is as bad as too little. We bneed to induce risks to rehearse, or we can move faster

V6-21 How to stay within SLO Simple answers, then more
complicated answers 20:00

V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. 35
35 Accelerate: State of DevOps 2021 You can have many small breaks, but not painful ones. Elite teams can afford to fail quickly

V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. What's
our recipe? 36 36 How do we go about turning lines of code into a live service in prod, as quickly and reliably as possible?

V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Instrument
as we code. 37 37 We practice observability-driven development. Before we even start implementing a feature we ask, “How is this going to behave in production?” and then we add instrumentation for that.   Our instrumentation generates not just flat logs but rich structured events that we can query and dig down into the context.

V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Functional
and visual testing. 38 38 We don’t stop there. We lean on our tests to give us confidence long before the code hits prod. You need both meaningful tests and rich instrumentation. Not clicking around, but using libraries and user stories, so we c

V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Design
for feature flag deployment. 39 39 We intentionally factor our code to make it easy to use feature flags, which allows us to separate deploys from releases and manage the blast radius of changes Roll out new features as no-ops from the user perspective Then we can turn on a flag in a specific environment or for a canary subset of traffic, and then ramp it up to everybody But we have a single build, the same code running across all environments.

V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Automated
integration & human review. 40 40 That union of human and technology is what makes our team a socio-technical system. You need to pay attention to both, so I made sure our CI robot friend gets a high five. All of our builds complete within 10 min, so you aren’t going to get distracted and walk away. If your code reviewer asks for a change, you can get to it quickly. Tight feedback loops

V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Green
button merge. 41 41 Once the CI is green and reviewers give the thumbs up, we merge. No “wait until after lunch” or “let’s do it tomorrow.” Why? Want to push and observe those changes while the context is still fresh in our heads. We merge and automatically deploy every day of the work week.

V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Auto-updates,
rollbacks, & pins. 42 42 We’ll talk more about how our code auto-updates across environments, and the situations when we’ll do a rollback or pin a specific version We roll it out thru three environments: kibble, dogfood, then prod.

V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Observe
behavior in prod. 43 43 No Friday Deploys Don’t Merge and Run! And finally we bring it full-circle. We observe the behavior of our change in production using the instrumentation we added at the beginning. Your job is not done until you close the loop and observe it in prod. We check right away and then a little bit later to see how it’s behaving with real traffic and real users. It’s not “no friday deploys”, it’s “don’t merge and run” ——- (next: three environments)

V6-21 Repeatable infrastructure with code. All our infrastructure is code
under version control. All changes are subject to peer review and go through a build process. It’s like gardening, you need to be proactive about pulling weeds. This way we never have changes in the wild.

V6-21 If infra is code, we can use CI &
flags! On top of that, we use cloud services to manage our Terraform state. We used to have people applying infrastructure changes from their local dev environments using their individual AWS credentials. With a central place to manage those changes, we can for example, limit our human AWS user permissions to be much safer. We use Terraform Cloud, and they’re kinda the experts on Terraform. We don’t have to spend a bunch of engineering resources standing up systems to manage our Terraform state for us. They already have a handle on it.

V6-21 Ephemeral fleets & autoscaling. We can turn on or
off AWS spot in our autoscaling groups and feature flags allow us to say, Hey under certain circumstances, let’s stand up a special fleet It’s pretty dope with you can use terraform variables to control whether or not infra is up. We can automatically provision ephemerlal fleets to catch up if we fall behind in our most important workloads.

V6-21 Quarantine bad traffic. It is possible to both do
some crazy ass shit in production and protect your users from any noticeable effects. You just need the right tools. What, like you were ever going to find those bugs in staging? If we have a misbehaving user we can quarantine them to a subset of our infrastructure We can set up a set of paths that get quarantined so we can keep it from crashing the main fleets, or do more rigorous testing. It is possible to both do crazy shit in production and protect your users from the noticeable effects So we can observe how their behavior affects our systems with like CPU profiling or memory profiling and prevent them from affecting other users —- (to Shelby)

V6-21 Validating our expectations 25:00

V6-21 Experiment using error budgets. You may be familiar with
the four key DORA metrics and the research published in the Accelerate book. These metrics aren’t independent data points. you can actually create positive virtuous cycles when you improve even one of those metrics. And that’s how we did it at Honeycomb. If you have extra budget, stick some chaos in there

V6-21 Always ensure safety. 50 Chaos engineering is _engineering_, not
just pure chaos. And if you don’t have observability, you probably just have chaos.

V6-21 51 We can use feature flags for an experiment
on a subset of users, or internal users

V6-21 Data persistence is tricky. That works really well for
stateless stuff but not when each request is not independent or you have data sitting on disk.

V6-21 Stateless request processing   Stateful data storage How do
we handle a data-driven service that allows us to become confident in that service? All frontend services are stateless, of course. But we also have a lot of kafka, retriever, and myslq We deploy our i nfra changes incrementally to reduce the blast radius We’re able to do that because we can deploy multiple a times in a day. There’s not a lot of manual overhead. So we can test the effects of changes to our infrastructure with much lower risk

V6-21 Let’s zoom in on the stateful part of that
infra diagram We deploy our infra changes incrementally to reduce the blast radius We’re able to do that because we can deploy multiple a times in a day. There’s not a lot of manual overhead. So we can test the effects of changes to our infrastructure with much lower risk

V6-21 Event batch Single event Single event Single event Partition
queue Single event Single event Single event Partition queue Single event Single event Single event Partition queue Single event Single event Single event Indexing worker Field inde x Field inde x Field inde x S3 Indexing worker Field inde x Field inde x Field inde x Indexing worker Field inde x Field inde x Field inde x Data flows in to shepherd, and that constitutes a batch of events, on the left. What do we do with those? We split then apart, asnd send them to the appropriate partition. If one partition is not writable, The middle tier here is kafka. Within a given kafka partition, Events here are preserved in order. This is important, because If you don’t have a deterministgic ordering, it’s very hard to ensure data integrity because you won’t have a idempotent or reliable source of where is this data coming from and what should i expect to see. The indexing workers decomposes them into one file per column per set of attributes. So “service name” comes in from multiple events, and then on each indexing worker service name from multiple events becomes its own file that it’s appended to in order. Based off the order it’s read from kafka. And then we’ll finally go ahead and tail it off to aws s3.

V6-21 Infrequent changes. What are the risks? Well, kafka doesn’t
change much. We update it maybe every couple months. They are also on very stable machines that don’t change very often. Unlike shepherd, which we run on spot instances and it’s copnstantly churning, we make sure kafka is on stable machines. They rarely get restarted without us. We have to make sure we can

V6-21 Data integrity and consistency. We have to make sure
we can survive the disappearance of any individual node, while also not having our nodes churn too very often.

V6-21 Delicate failover dances There’s a very delicate failover dance
that has to happen whenever we lose a stateful node, whether that is kafka, zookeeper, or retriever.

queue Single event Single event Single event Partition queue Single event Single event Single event Partition queue Single event Single event Single event Indexing worker Field inde x Field inde x Field inde x S3 Indexing worker Field inde x Field inde x Field inde x Indexing worker Field inde x Field inde x Field inde x So what happens if we lose a kafka broker? What’s supposed to happen, is all brokers have replicas on other brokers.

queue Single event Single event Single event Partition queue Single event Single event Single event Partition queue Single event Single event Single event Indexing worker Field inde x Field inde x Field inde x S3 Indexing worker Field inde x Field inde x Field inde x Indexing worker Field inde x Field inde x Field inde x When there’s a new kafka node available, it receives all of the old partitions that the old kafka node is responsible for, and may or may not get promoted to leader in its own due time

queue Single event Single event Single event Partition queue Single event Single event Single event Partition queue Single event Single event Single event Indexing worker Field inde x Field inde x Field inde x S3 Indexing worker Field inde x Field inde x Field inde x Indexing worker Field inde x Field inde x Field inde x If we lose an indexing worker, well we don’t run a single indexer per partition, we run two. The other thing that’s supposed to happen is

queue Single event Single event Single event Partition queue Single event Single event Single event Partition queue Single event Single event Single event Indexing worker Field inde x Field inde x Field inde x S3 Indexing replay Field inde x Field inde x Field inde x Indexing worker Field inde x Field inde x Field inde x We are supposed to be able to replay and restore that indexing worker, either from a peer — the original design, about 5 years ago — or from filesystem snapshots. Either way you have a stale set of data that you’re replaying from a backup, and now that ordering of that partition cue becomes really important, right? Because you KNOW where you snapshot, and you can replay that partition forward. and your snapshot is no more than an hour old then you only have to replay the most recent hour. Great! So, how can we test this??

V6-21 Experimenting in prod This is how we continuously test
our kafkas and retrievers to make sure they’re doing what we expect.

V6-21 Restart one server & service at a time. 64
The goal is to test, not to destroy. First of all, we’re testing, not destroying. One server from one service at a time. These are calculated risks, so calcculate

V6-21 At 3pm, not at 3am. 65 Don’t be an
asshole You want to help people practice when things go wrong, and you want to practice under peak condition!

V6-21 "Bugs are shallow with more eyes." 66

V6-21 Monitor for changes using SLIs. 67 Monitoring isn’t a
bad word, it just isn’t observability. SLOs are a modern form of monitoring. Monitoring isn’t a bad word, it just isn’t observability. Let’s monitor our SLIs. did we impact our monitoring? SLOs are a modern form of monitoring.

V6-21 Debug with observability. 68 When something does go wrong,
it’s probably something you didn’t anticipate (duh) which means you rely on instrumentation and rich events to explore and ask new questions.

V6-21 Test the telemetry too! 69 It’s not enough to
just test the node. What if you replace a kafka node, but the node continues reporting that it’s healthy? Even if it got successfully replaced, this can inhibit your ability to debug We think it’s important to use chaos engineering not to just test our systems but also our ability to observe our systems.

V6-21 Verify fixes by repeating. 70 If something broke and
you fixed it, don’t assume it’s fixed til you try

queue Single event Single event Single event Partition queue Single event Single event Single event Partition queue Single event Single event Single event Indexing worker Field inde x Field inde x Field inde x S3 Indexing worker Field inde x Field inde x Field inde x Indexing worker Field inde x Field inde x Field inde x Let’s talk about this hypothesis again. What if you lose a kafka node, and the new one doesn’t come back up?

queue Single event Single event Single event Partition queue Single event Single event Single event Partition queue Single event Single event Single event Indexing worker Field inde x Field inde x Field inde x S3 Indexing worker Field inde x Field inde x Field inde x Indexing worker Field inde x Field inde x Field inde x FORESHADOWING Turns out that testing your automatic kafka balancing is SUPER important. We caught all kinds of interesting things that have happened inside the kafka rebalancer simply by killing nodes and seeing whether they come back successfully and start serving traffic again. We need to know this, because if there’s a major outage and we aren’t able to reshuffle the data on demand, this can be a serious emergency. It can manifest as disks filling up rapidly, if you have five nodes consuming the data normally handled by six . And if you’re doing this during daytime peak, and if you’re also trying to catch up a brand new kafka broker at the same time, that can overload the system.

V6-21 Alerting worker Alerting worker Zookeeper cluster Yes, it is
2022 and people are still running zookeeper. People like us. Let’s talk about another category of failure we’ve found through testing!! So honeycomb lets our customers send themselves alerts, on any defined query if something is wrong. According to the criteria they gave us. We want you to get exactly one alert, not duplicates, not zero. So how do we do this? We elect a leader to run the alerts for a given minute using zookeeper. Zookeeper is redundancy, right?! Let’s kill one and find out!

V6-21 Alerting worker Alerting worker Zookeeper cluster Annnnd the alerts
didn’t run. Why? Well, both alerting workers were configured to try to talk to index zero only. We killed a node twice, no problem Third time, we killed index zero

V6-21 Alerting worker Alerting worker Zookeeper cluster I replaced index
node zero And the learning workers didn’t run. So we discovered at 3pm, not 3am a bug that would eventually have bitten us in the ass and made customers unhappy. Mitigation, of course, was just to make sure that oru zookeeper client is talking to all of the zookeeper nodes.

V6-21 76 De-risk with design & automation.

V6-21 Partition queue Single event Single event Single event Partition
queue Single event Single event Single event Partition queue Single event Single event Single event Indexing worker Field inde x Field inde x Field inde x Indexing worker Field inde x Field inde x Field inde x Indexing worker Field inde x Field inde x Field inde x Indexing worker Field inde x Field inde x Field inde x Indexing worker Field inde x Field inde x Field inde x Indexing worker Field inde x Field inde x Field inde x S3 Our previous design for retrievers was that if one went down, the other would rsync off its buddy to recover. But what if you lose two indexing workers at the same time from the same partition? Eh, that’ll never happen. So as we’re cycling our retriever fleet, or in the middle of moving them to a new class of instances, wouldn’t it be nice if it didn’t feel like stepping very very carefully through a crowded minefield of danger to make sure you never hit two of the same ? What if, instead of having to worry about your peers all the time, you could just replay off the aws snapshot? Makes your bootstrap choice a lot more reliable. The more workers we have over time, the scarier that was going to become. So yeah we’re albe to do things now that we can restore workers on demand. And we continuously

V6-21 78 Continuously verify to stop regression. Every monday at
3 pm, we kill the oldest retriever node Every tuesday at 3 pnm, we klil the oldest kafka node \ That way we can verify continuously that our node replacement systems are working properly and that we are adequately provisioned to handle losing a node during peak systems. How often do you think we get paged about this at night, now?

V6-21 Save money with flexibility. 79 Finally, we want user
queries to return fast, but we’re not as strict about this. So we want 99% of user queries to return results in less than 10s —-- (next: back to graph)

V6-21 ARM64 hosts   Spot instances What happens when you
have lots of confidence in your systems ability to continuously repair and flex? You get to deploy lots of fun things to help make your life easier and make your service performant and costgless. Out of this entire diagram, our entire forward tier has been converted into spot instances. Preemptable aws instances. Because they recover from being restarted very easily, we can take advantage of that 60% cost savings. Secondly, three of these services are running on graviton 2 class, knowing that if there were a problem, we could easily revert.

V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Non-trivial
savings. Production Shepherd EC2 cost, grouped by instance type But having rolled it out, it saved us 40% off our bill. Having the ability to take that leftover error budget and turn it into innovation or turn it into cost savings is how you justify being able to set that error budget and experiment with the error budget. Use it for the good of your service!

V6-21 Not every experiment succeeds. But you can mitigate the
risks. 45:00

V6-21 • Ingest service crash   • Kafka instability  
• Query performance degradation and what we learned from each. Three case studies of failure Three things that went catastrophically wrong, where we were at risk of violating one of our SLOs.

V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. 1)
Shepherd: ingest API service Shepherd is the gateway to all ingest • highest-traffic service • stateless service • cares about throughput first, latency close second • used compressed JSON • gRPC was needed. for graviton2, we chose to try things out on shepherd because it’s the highest-traffic but it’s also relatively straightforward it’s stateless and it only scales on CPU as a service, it’s optimized for throughput first and then latency We care about getting that data and sucking it out of customers and onto our disks very fast. We were previoiusly using a compressed json payload, transmitted over https. However, there is a new standard called open telemetry, a vendor neutral mechanism for transmitting for collecting data out of a service, including tracing data or metrics. It supports a grpc based protocol.over http2. Our customers were asking for this, and we knew it would be better and more effective for them in the long run. So we decided to make sure we can ingest not just our old http json protocol, but also the newer grpc protocol. So we said okay let’s go ahead and turn on a grpc listener, okay it works fine!

85 Honeycomb Ingest Outage • In November, we were working on OTLP and gRPC ingest support   • Let a commit deploy that attempted to bind to a privileged port   • Stopped the deploy in time, but scale-ups were trying to use the new build   • Latency shot up, took more than 10 minutes to remediate, blew our SLO Except it was binding on a privileged port, and crashing on startup. We managed to stop the deploy in time, thanks to a previous outage we had where we pushed out a bunch of binaries that didn’t build, so we had some health checks in place that would stop it from rolling out any further. That’s the good news. The bad news is, the new workers that were starting up were getting the new binary, and those workers were failing to serve traffic. And not only that, but because they weren’t serving traffic the cpu usage was zero. So aws autoscaler was like hey, let’s take that service and turn it down, you aren’t using it. So latency facing our end users went really high, and took us more than 10 minutes to remediate, which blew our SLO error budget

86 Now what? • We could freeze deploys (oh no, don’t do this!)   • Delay the launch? We considered this...   • Get creative! The SRE book says freeze deploys. Dear god no don’t do this. More and more product changes will just pile up. And the risk increases. Code ages like fine milk. So we recommend changingthe nature of your work from product features to reliability work, but using your normal deploy pipeline. So it’s changing the nature of the work you’re doing, but it’s not stopping all work, and it’s not setting traps for you later. By just blissfully pounding out features and hoping someday they work in production Should we delay the launch? That sucks, it was a really important launch for a partner of ours, and we knew our users wanted it.

V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Reduce
Risk 87 87 So we decided instead to apply infrastructure feature flagging. We decided to send the experimental path to a different set of workers to send http2 grpc traffic to a dedicated branch. That way, we can keep the 99.9% of users using json perfectly traffic, because we are tee-ing the traffic for them. At the load balancer level. This is hwo we ensured we could reliably serve as well as experiment. There were some risks, right? We had to ensure we were continuously integrating both branches together, we had to make sure that we had a mechanism for turning it down over time, but those are manageable compared to the cost of either not doing the launch or freezing releases.

V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. 2)
Kafka: data bus Kafka provides durability • Decoupling components provides safety. • But introduces new dependencies. • And things that can go wrong. So that was one example of making decisions based on our error budget. Let’s talk about a second outage Kafka is our persistence layer. Once shepherd has handed it off to kafka, the shepherd can go away, and it won’t drop data. It’s sitting in a queue waiting for a retriever indexer to pick it up. Decoupling the components provides safety. It means we can restart either producers or consumers more or less on demand, and they’ll be able to catch up and replay and get back to where you were. Kafka has a cost, though: complexity.

V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Our
month of Kafka pain Read more: go.hny.co/kafka-lessons Longtime Confluent Kafka users First to use Kafka on Graviton2 at scale Changed multiple variables at once • move to tiered storage • i3en → c6gn • AWS Nitro between shepherd and retriever sits kafka it allows us to decouple those two services and replay event streams We were having scalability issues with our kafka and needed to improve the reliability of them by consolidating. Instead of having 30 kafka nodes with very very large SSDs, we realized that because we are only replaying the most recent hour or two of data (unless something goes catastrophically wrong) on local ssd. Not only that, but there were out of these 30 individual kafka brokers, if any one of them went bad, you would be be in the middle of reshuffling nodes, and then if you lost another one it would just be siting idle because you can’t do a kafka rebalance while another rebalance is in process. So we tried tiered storage which would let us shrink from 30 to 6 kafka nodes. And the disks on those kafka brokers might be a little larger, but not 5x larger. So instead we’re sending that extra data off to aws s3. Then liz, loving arm 64 so much, was like why are we even using these monolithic nodes and local disks, isn’t ebs good enough? Can’t we use the highest compute power nodes and the highest performance disk perf. So we are now doing three changes at the same time. we were actually testing Kafka on Graviton2 before even Confluent did probably the first to use it for production workloads changed too many variables at once wanted to move to tiered storage to reduce the number of instances but also tried the arch switch from i3en to c6gn+EBS at the same time we also introduced AWS Nitro (hypervisor) that was a mistake we published a blog post on this experience as well as a full incident report I highly recommend that you go read it to better understand the decisions we made and what we learned

V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Unexpected
constraints Read more: go.hny.co/kafka-lessons We thrashed multiple dimensions. We tickled hypervisor bugs. We tickled EBS bugs. Burning our people out wasn't worth it. And it exploded on us. We thought we were going to be right sizing cpu and disk, instead we blew out the network dimensions. We blew out the iops dimensions. Technically, we did not blow o ur SLO thru any of this. Except for there is another hidden slo. That SLO is that you do not page a honeycomb team more often than twice a weekl. Every engineer should have to work an incident out of hours, no more than once every six months. They’re on call every month or two, so you should have no more than one or two of those shifts that have We had to call a halt to the experiment. We were changing too many dimensions at once, chasing extra performance, it was not worth it.

V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Existing
incident response practices • Escalate when you need a break / hand- off • Remind (or enforce) time off work to make up for off-hours incident response Official Honeycomb policy • Incident responders are encouraged to expense meals for themselves and family during an incident Take care of your people We have pretty good incident response practices, we have blameless retrospectives, we had people handing off work to each other saying “you know what, i’m too tired, i can’t work on this incident any more.” we had people taking breaks afterwards. Being an adult means taking care of each other and taking care of yourself. Please expense your meals during an incident. incidents happen we had existing practices that helped a lot the meal policy was one of those things that just made perfect sense once somebody articulated it it’s good to document and make official policy out of things that are often unspoken agreements or assumptions —- one of our values is that we hire adults, and adults have responsibilities outside of work you won’t build a sustainable, healthy sociotechnical system if you don’t account for that in general it’s good to document and make official policy out of things that are often unspoken agreements or assumptions

V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Ensure
people don’t feel rushed. Complexity multiplies • if a software program change takes t hours, • software system change takes 3t hours • software product change also takes 3t hours • software system product change = 9t hours Maintain tight feedback loops, but not everything has an immediate impact. Optimize for safety Source: Code Complete, 2nd Ed. We rushed a little in doing this. We didn’t blow our technical slo but we did blow our people hours are an imperfect measurement of complexity, but it’s a useful heuristic to keep in mind basically: complexity multiplies tight feedback loops help us isolate variables but some things just require observation over time isolating variables also makes it easier for people to update their mental models as changes go out

V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Retriever
is performance-critical • It calls to Lambda for parallel compute • Lambda use exploded. • Could we address performance & cost? • Maybe. 3) Retriever: query service Its job is to ingest data, index it and make it available for serving. Jess and ian talk at strangeloop Retriever fans out to potentially tens of thousands or millions of individual column store files that are stored in aws s3. So we adopted aws lambda and aws serverless to usemassively parallel compute on demand to read through those files on s3.

94 Because we had seen really great results with graviton 2 for ec2 instances, we thought, maybe we should try that for lambda too! So we deployed to 1%, then 50%. Then we noticed things were twice as slow at the 99th percentile. Which means we are not getting cost savings because aws lambda does bill by the millisecond, and we were delivering inferior results.

95 This is another example of how we were able to use our error budget to perform this experiment and the controls to roll it roll it back. And you can see that when we turned it off, it just turned off. Liz updated the flag at 6:48 pm, and at 6:48 pm you can see that orange line go to zero.

96 Making progress carefully After that we decided we were not going to do experiments 50% at a time. We had already burned through that error budget so we started doing more rigorous performance profiling to identify the bottleneck. So we turn it on for a little bit and then we turn it back off. That way we get safety, stability, and the data we need to safely experiment.

V6-21 Fast and reliable: pick both! Go faster, safely. 55:00
Chaos engineering is something to do once you’ve taken care of the

V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Takeaways
98 98 • Design for reliability through full lifecycle. • Feature flags can keep us within SLO, most of the time.   • But even when they can't, find other ways to mitigate risk.   • Discovering & spreading out risk improves customer experiences.   • Black swans happen; SLOs are a guideline, not a rule. If you’re running your continuous delivery pipelines throughout the day, then stopping them becomes the anomaly, not starting them. So by designing our delivery pipeline for reliability thru the full life cycle, we’ve ensured that we’re mostly able to meter us loads. Feature flags can keep us within SLOs, most of the time by managing the blast radius. Even when software flags can’t, there are other infrastructure level things you can do, such as running special workers to segregate traffic that is especially risky. By discovering risk at 3pm not 3am, it ensures the customer experience is much more resilient because you’ve actually tested the things that could go bump in the middle of the night. But if you do have a black swan event, it’s a guidelines not a rule. You don’t HAVE to say we’re switching everyone over to entirely reliability work. If you have something like massive facebook DNS outage or BGP outage.. It’s okay to hit reset on your error budget and say you know what, that’s probably not going to happen again. SLOs are for managing predictable-ish risks.

V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Examples
of hidden risks • Operational complexity • Existing tech debt • Vendor code and architecture • Unexpected dependencies • SSL certificates • DNS Discover early and often through testing. Acknowledge hidden risks

V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Make
experimentation routine! 100 100 If you do this continuously all the time, a conversation like this becomes no longer preposterous. This actually would be chaos engineering, not just chaos. We have the ability to measure and watch our SLOs, we have the ability to limit the blast radius, and the observability to carefully inspect the result. That ‘s what makes it reasonable to say “hey let;s try denial of servicing our own workers, let’s try restarting the workers every 20 seconds and see what happens. Worst case, hit control-c on the script and it stops.

V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Takeaways
101 101 • We are part of sociotechnical systems: customers, engineers, stakeholders • Outages and failed experiments are unscheduled learning opportunities   • Nothing happens without discussions between different people and teams   • Testing in production is fun AND good for customers   • Where should you start? DELIVERY TIME DELIVERY TIME DELIVERY TIME SLOs are an opportunity to have these conversations and find opportunities to move faster and talk about the tradeoffs between stability and speed, and are there creative things we can do to say yes to both

V6-21 Understand & control production. Go faster on stable infra.
Manage risk and iterate. 102

V6-21 Read our blog! hny.co/blog   We're hiring! hny.co/careers Find
out more

V6-21 www.honeycomb.io

Observability and the Glorious Future

Observability and the Glorious Future

More Decks by Charity Majors

Other Decks in Technology

Featured

Transcript