Observability and the Glorious Future (with Liz Fong-Jones)

V6-21 Charity Majors (slides by Liz Fong-Jones) CTO, Honeycomb @mipsytipsy
at Infrastructure & Ops Superstream: Observability Observability And the Glorious Future w/ illustrations by @emilywithcurls!

V6-21 Observability is evolving quickly. 2 “Your bugs are evolving
faster”

V6-21 3 INSTRUMENT QUERY OPERATIONAL RESILIENCE MANAGED TECH DEBT QUALITY
CODE PREDICTABLE RELEASE USER INSIGHT Outcomes Actions DATA And the problem space is complex. Anyone who tells you that you can just “buy their tool” and get a high-performing engineering team, is selling you something stupid

V6-21 Practitioners need velocity, reliability, & scalability. 4 You DO
NOT ACTUALLY KNOW if your code is working or not until you have observed it in production

V6-21 A small but growing team builds Honeycomb. 5

V6-21 We deploy with conﬁdence. 6

V6-21 7

V6-21 When it comes to software, speed is safety. Like
ice skating, or bicycling. Speed up, gets easier. Slow down, gets wobblier.

V6-21 All while traffic has surged 3-5x in a year.

V6-21 Write workload, trailing year

V6-21 Read workload, trailing year

V6-21 Our confidence recipe:

V6-21 Quantify reliability. 13 “Always up” isn’t a number, dude.
And if you think you’re “always up,” your telemetry is terrible.

V6-21 Identify potential areas of risk. So many teams never
look at their instrumentation until something is paging them. That is why they suffer. They only respond to heart attacks instead of eating vegetables and minding their god damn cholesterol.

V6-21 Design experiments to probe risk. Outages are just experiments
you didn’t think of yet :D

V6-21 Prioritize addressing risks.

V6-21 Measuring reliability:

V6-21 How broken is “too broken”? 18

V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Service
Level Objectives (SLOs) Deﬁne and measure success! Popularized by Google, widely adopted now!

V6-21 SLOs are common language. SLOs are the APIs between
teams that allow you to budget and plan instead of reacting and arguing. Loose coupling FTW!

V6-21 Think in terms of events in context. 21 P.S.
if you aren’t thinking in terms of (and capturing, and querying) arbitrarily-wide structured events, you are not doing observability. Rich context is the beating heart of observability.

V6-21 Is this event good or bad? 22

V6-21 Honeycomb's SLOs reﬂect user value. 23

V6-21 We make systems humane to run, 24

V6-21 by ingesting telemetry, 25

V6-21 enabling data exploration, 26

V6-21 and empowering engineers. 27

V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. What
Honeycomb does • Ingests customer’s telemetry • Indexes on every column • Enables near-real-time querying on newly ingested data Data storage engine and analytics flow

V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. SLOs
are user flows Honeycomb’s SLOs • home page loads quickly (99.9%) • user-run queries are fast (99%) • customer data gets ingested fast (99.99%)

V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Service-Level
Objectives 30 30 • Example Service-Level Indicators: ◦ 99.9% of queries succeed within 10 seconds over a period of 30 days. ◦ 99.99% of events are processed without error in 5ms over 30 days. • 99.9% ≈ 43 minutes of violation in a month. • 99.99% ≈ 4.3 minutes of violation in a month. but services aren't just 100% down or 100% up. DEGRADATION IS UR FRIEND

V6-21 Data-driven decisions and tradeoffs. 31

V6-21 Should we invest in more reliability? 32

V6-21 Is it safe to do this risky experiment? 33

V6-21 How to stay within SLO Simple answers, then more
complicated answers

V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. 35
35 Accelerate: State of DevOps 2021

V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. What's
our recipe? 36 36

V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Instrument
as we code. 37 37

V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Functional
and visual testing. 38 38

V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Design
for feature ﬂag deployment. 39 39

V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Automated
integration & human review. 40 40

V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Green
button merge. 41 41

V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Auto-updates,
rollbacks, & pins. 42 42

V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Observe
behavior in prod. 43 43 No Friday Deploys Don’t Merge and Run!

V6-21 Repeatable infrastructure with code.

V6-21 If infra is code, we can use CI &
ﬂags!

V6-21 Ephemeral ﬂeets & autoscaling.

V6-21 Quarantine bad traﬃc. It is possible to both do
some crazy ass shit in production and protect your users from any noticeable effects. You just need the right tools. What, like you were ever going to find those bugs in staging?

V6-21 Validating our expectations

V6-21 Experiment using error budgets.

V6-21 Always ensure safety. 50

V6-21 51

V6-21 Data persistence is tricky.

V6-21 Stateless request processing Stateful data storage

V6-21 Event batch Single event Single event Single event Partition
queue Single event Single event Single event Partition queue Single event Single event Single event Partition queue Single event Single event Single event Indexing worker Field index Field index Field index S3 Indexing worker Field index Field index Field index Indexing worker Field index Field index Field index

V6-21 Infrequent changes.

V6-21 Data integrity and consistency.

V6-21 Delicate failover dances

queue Single event Single event Single event Partition queue Single event Single event Single event Partition queue Single event Single event Single event Indexing worker Field index Field index Field index S3 Indexing replay Field index Field index Field index Indexing worker Field index Field index Field index

V6-21 Experimenting in prod

V6-21 Restart one server & service at a time. 64
The goal is to test, not to destroy.

V6-21 At 3pm, not at 3am. 65

V6-21 "Bugs are shallow with more eyes." 66

V6-21 Monitor for changes using SLIs. 67 Monitoring isn’t a
bad word, it just isn’t observability. SLOs are a modern form of monitoring.

V6-21 Debug with observability. 68

V6-21 Test the telemetry too! 69

V6-21 Verify ﬁxes by repeating. 70

queue Single event Single event Single event Partition queue Single event Single event Single event Partition queue Single event Single event Single event Indexing worker Field index Field index Field index S3 Indexing worker Field index Field index Field index Indexing worker Field index Field index Field index FORESHADOWING

V6-21 Alerting worker Alerting worker Zookeeper cluster Yes, it is
2022 and people are still running zookeeper. People like us.

V6-21 Alerting worker Alerting worker Zookeeper cluster

V6-21 76 De-risk with design & automation.

V6-21 Partition queue Single event Single event Single event Partition
queue Single event Single event Single event Partition queue Single event Single event Single event Indexing worker Field index Field index Field index Indexing worker Field index Field index Field index Indexing worker Field index Field index Field index Indexing worker Field index Field index Field index Indexing worker Field index Field index Field index Indexing worker Field index Field index Field index S3

V6-21 78 Continuously verify to stop regression.

V6-21 Save money with ﬂexibility. 79

V6-21 ARM64 hosts Spot instances

V6-21 Not every experiment succeeds. But you can mitigate the
risks.

V6-21 • Ingest service crash • Kafka instability • Query
performance degradation and what we learned from each. Three case studies of failure

V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. 1)
Shepherd: ingest API service Shepherd is the gateway to all ingest • highest-traﬃc service • stateless service • cares about throughput ﬁrst, latency close second • used compressed JSON • gRPC was needed.

85 Honeycomb Ingest Outage • In November, we were working on OTLP and gRPC ingest support • Let a commit deploy that attempted to bind to a privileged port • Stopped the deploy in time, but scale-ups were trying to use the new build • Latency shot up, took more than 10 minutes to remediate, blew our SLO

86 Now what? • We could freeze deploys (oh no, don’t do this!) • Delay the launch? We considered this... • Get creative!

V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. 2)
Kafka: data bus Kafka provides durability • Decoupling components provides safety. • But introduces new dependencies. • And things that can go wrong.

V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Our
month of Kafka pain Read more: go.hny.co/kafka-lessons Longtime Conﬂuent Kafka users First to use Kafka on Graviton2 at scale Changed multiple variables at once • move to tiered storage • i3en → c6gn • AWS Nitro

V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Unexpected
constraints Read more: go.hny.co/kafka-lessons We thrashed multiple dimensions. We tickled hypervisor bugs. We tickled EBS bugs. Burning our people out wasn't worth it.

V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Existing
incident response practices • Escalate when you need a break / hand-off • Remind (or enforce) time off work to make up for off-hours incident response Oﬃcial Honeycomb policy • Incident responders are encouraged to expense meals for themselves and family during an incident Take care of your people

V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Ensure
people don’t feel rushed. Complexity multiplies • if a software program change takes t hours, • software system change takes 3t hours • software product change also takes 3t hours • software system product change = 9t hours Maintain tight feedback loops, but not everything has an immediate impact. Optimize for safety Source: Code Complete, 2nd Ed.

V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Retriever
is performance-critical • It calls to Lambda for parallel compute • Lambda use exploded. • Could we address performance & cost? • Maybe. 3) Retriever: query service

94

95

96 Making progress carefully

V6-21 Fast and reliable: pick both! Go faster, safely.

V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Takeaways
98 98 • Design for reliability through full lifecycle. • Feature ﬂags can keep us within SLO, most of the time. • But even when they can't, ﬁnd other ways to mitigate risk. • Discovering & spreading out risk improves customer experiences. • Black swans happen; SLOs are a guideline, not a rule.

V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Examples
of hidden risks • Operational complexity • Existing tech debt • Vendor code and architecture • Unexpected dependencies • SSL certiﬁcates • DNS Discover early and often through testing. Acknowledge hidden risks

V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Takeaways
101 101 • We are part of sociotechnical systems: customers, engineers, stakeholders • Outages and failed experiments are unscheduled learning opportunities • Nothing happens without discussions between different people and teams • Testing in production is fun AND good for customers • Where should you start? DELIVERY TIME DELIVERY TIME DELIVERY TIME

V6-21 Understand & control production. Go faster on stable infra.
Manage risk and iterate. 102

V6-21 Read our blog! hny.co/blog We're hiring! hny.co/careers Find out
more

V6-21 www.honeycomb.io

Observability and the Glorious Future (with Liz...

Observability and the Glorious Future (with Liz Fong-Jones)

More Decks by Charity Majors

Other Decks in Technology

Featured

Transcript