Observability and the Glorious Future (with Liz Fong-Jones)

Slide 1

Slide 1 text

V6-21 Charity Majors (slides by Liz Fong-Jones) CTO, Honeycomb @mipsytipsy at Infrastructure & Ops Superstream: Observability Observability And the Glorious Future w/ illustrations by @emilywithcurls!

Slide 2

Slide 2 text

V6-21 Observability is evolving quickly. 2 “Your bugs are evolving faster”

Slide 3

Slide 3 text

V6-21 3 INSTRUMENT QUERY OPERATIONAL RESILIENCE MANAGED TECH DEBT QUALITY CODE PREDICTABLE RELEASE USER INSIGHT Outcomes Actions DATA And the problem space is complex. Anyone who tells you that you can just “buy their tool” and get a high-performing engineering team, is selling you something stupid

Slide 4

Slide 4 text

V6-21 Practitioners need velocity, reliability, & scalability. 4 You DO NOT ACTUALLY KNOW if your code is working or not until you have observed it in production

Slide 5

Slide 5 text

V6-21 A small but growing team builds Honeycomb. 5

Slide 6

Slide 6 text

V6-21 We deploy with conﬁdence. 6

Slide 7

Slide 7 text

V6-21 7

Slide 8

Slide 8 text

V6-21 When it comes to software, speed is safety. Like ice skating, or bicycling. Speed up, gets easier. Slow down, gets wobblier.

Slide 9

Slide 9 text

V6-21 All while traffic has surged 3-5x in a year.

Slide 10

Slide 10 text

V6-21 Write workload, trailing year

Slide 11

Slide 11 text

V6-21 Read workload, trailing year

Slide 12

Slide 12 text

V6-21 Our confidence recipe:

Slide 13

Slide 13 text

V6-21 Quantify reliability. 13 “Always up” isn’t a number, dude. And if you think you’re “always up,” your telemetry is terrible.

Slide 14

Slide 14 text

V6-21 Identify potential areas of risk. So many teams never look at their instrumentation until something is paging them. That is why they suffer. They only respond to heart attacks instead of eating vegetables and minding their god damn cholesterol.

Slide 15

Slide 15 text

V6-21 Design experiments to probe risk. Outages are just experiments you didn’t think of yet :D

Slide 16

Slide 16 text

V6-21 Prioritize addressing risks.

Slide 17

Slide 17 text

V6-21 Measuring reliability:

Slide 18

Slide 18 text

V6-21 How broken is “too broken”? 18

Slide 19

Slide 19 text

Slide 20

Slide 20 text

V6-21 SLOs are common language. SLOs are the APIs between teams that allow you to budget and plan instead of reacting and arguing. Loose coupling FTW!

Slide 21

Slide 21 text

V6-21 Think in terms of events in context. 21 P.S. if you aren’t thinking in terms of (and capturing, and querying) arbitrarily-wide structured events, you are not doing observability. Rich context is the beating heart of observability.

Slide 22

Slide 22 text

V6-21 Is this event good or bad? 22

Slide 23

Slide 23 text

V6-21 Honeycomb's SLOs reﬂect user value. 23

Slide 24

Slide 24 text

V6-21 We make systems humane to run, 24

Slide 25

Slide 25 text

V6-21 by ingesting telemetry, 25

Slide 26

Slide 26 text

V6-21 enabling data exploration, 26

Slide 27

Slide 27 text

V6-21 and empowering engineers. 27

Slide 28

Slide 28 text

V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. What Honeycomb does ● Ingests customer’s telemetry ● Indexes on every column ● Enables near-real-time querying on newly ingested data Data storage engine and analytics flow

Slide 29

Slide 29 text

V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. SLOs are user flows Honeycomb’s SLOs ● home page loads quickly (99.9%) ● user-run queries are fast (99%) ● customer data gets ingested fast (99.99%)

Slide 30

Slide 30 text

V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Service-Level Objectives 30 30 ● Example Service-Level Indicators: ○ 99.9% of queries succeed within 10 seconds over a period of 30 days. ○ 99.99% of events are processed without error in 5ms over 30 days. ● 99.9% ≈ 43 minutes of violation in a month. ● 99.99% ≈ 4.3 minutes of violation in a month. but services aren't just 100% down or 100% up. DEGRADATION IS UR FRIEND

Slide 31

Slide 31 text

V6-21 Data-driven decisions and tradeoffs. 31

Slide 32

Slide 32 text

V6-21 Should we invest in more reliability? 32

Slide 33

Slide 33 text

V6-21 Is it safe to do this risky experiment? 33

Slide 34

Slide 34 text

V6-21 How to stay within SLO Simple answers, then more complicated answers

Slide 35

Slide 35 text

Slide 36

Slide 36 text

Slide 37

Slide 37 text

Slide 38

Slide 38 text

Slide 39

Slide 39 text

Slide 40

Slide 40 text

Slide 41

Slide 41 text

Slide 42

Slide 42 text

Slide 43

Slide 43 text

Slide 44

Slide 44 text

V6-21 Repeatable infrastructure with code.

Slide 45

Slide 45 text

V6-21 If infra is code, we can use CI & ﬂags!

Slide 46

Slide 46 text

V6-21 Ephemeral ﬂeets & autoscaling.

Slide 47

Slide 47 text

V6-21 Quarantine bad traﬃc. It is possible to both do some crazy ass shit in production and protect your users from any noticeable effects. You just need the right tools. What, like you were ever going to find those bugs in staging?

Slide 48

Slide 48 text

V6-21 Validating our expectations

Slide 49

Slide 49 text

V6-21 Experiment using error budgets.

Slide 50

Slide 50 text

V6-21 Always ensure safety. 50

Slide 51

Slide 51 text

V6-21 51

Slide 52

Slide 52 text

V6-21 Data persistence is tricky.

Slide 53

Slide 53 text

V6-21 Stateless request processing Stateful data storage

Slide 54

Slide 54 text

V6-21

Slide 55

Slide 55 text

V6-21 Event batch Single event Single event Single event Partition queue Single event Single event Single event Partition queue Single event Single event Single event Partition queue Single event Single event Single event Indexing worker Field index Field index Field index S3 Indexing worker Field index Field index Field index Indexing worker Field index Field index Field index

Slide 56

Slide 56 text

V6-21 Infrequent changes.

Slide 57

Slide 57 text

V6-21 Data integrity and consistency.

Slide 58

Slide 58 text

V6-21 Delicate failover dances

Slide 59

Slide 59 text

Slide 60

Slide 60 text

Slide 61

Slide 61 text

Slide 62

Slide 62 text

V6-21 Event batch Single event Single event Single event Partition queue Single event Single event Single event Partition queue Single event Single event Single event Partition queue Single event Single event Single event Indexing worker Field index Field index Field index S3 Indexing replay Field index Field index Field index Indexing worker Field index Field index Field index

Slide 63

Slide 63 text

V6-21 Experimenting in prod

Slide 64

Slide 64 text

V6-21 Restart one server & service at a time. 64 The goal is to test, not to destroy.

Slide 65

Slide 65 text

V6-21 At 3pm, not at 3am. 65

Slide 66

Slide 66 text

V6-21 "Bugs are shallow with more eyes." 66

Slide 67

Slide 67 text

V6-21 Monitor for changes using SLIs. 67 Monitoring isn’t a bad word, it just isn’t observability. SLOs are a modern form of monitoring.

Slide 68

Slide 68 text

V6-21 Debug with observability. 68

Slide 69

Slide 69 text

V6-21 Test the telemetry too! 69

Slide 70

Slide 70 text

V6-21 Verify ﬁxes by repeating. 70

Slide 71

Slide 71 text

Slide 72

Slide 72 text

Slide 73

Slide 73 text

V6-21 Alerting worker Alerting worker Zookeeper cluster Yes, it is 2022 and people are still running zookeeper. People like us.

Slide 74

Slide 74 text

V6-21 Alerting worker Alerting worker Zookeeper cluster

Slide 75

Slide 75 text

V6-21 Alerting worker Alerting worker Zookeeper cluster

Slide 76

Slide 76 text

V6-21 76 De-risk with design & automation.

Slide 77

Slide 77 text

V6-21 Partition queue Single event Single event Single event Partition queue Single event Single event Single event Partition queue Single event Single event Single event Indexing worker Field index Field index Field index Indexing worker Field index Field index Field index Indexing worker Field index Field index Field index Indexing worker Field index Field index Field index Indexing worker Field index Field index Field index Indexing worker Field index Field index Field index S3

Slide 78

Slide 78 text

V6-21 78 Continuously verify to stop regression.

Slide 79

Slide 79 text

V6-21 Save money with ﬂexibility. 79

Slide 80

Slide 80 text

V6-21 ARM64 hosts Spot instances

Slide 81

Slide 81 text

Slide 82

Slide 82 text

V6-21 Not every experiment succeeds. But you can mitigate the risks.

Slide 83

Slide 83 text

V6-21 ● Ingest service crash ● Kafka instability ● Query performance degradation and what we learned from each. Three case studies of failure

Slide 84

Slide 84 text

V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. 1) Shepherd: ingest API service Shepherd is the gateway to all ingest ● highest-traﬃc service ● stateless service ● cares about throughput ﬁrst, latency close second ● used compressed JSON ● gRPC was needed.

Slide 85

Slide 85 text

V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. 85 85 Honeycomb Ingest Outage ● In November, we were working on OTLP and gRPC ingest support ● Let a commit deploy that attempted to bind to a privileged port ● Stopped the deploy in time, but scale-ups were trying to use the new build ● Latency shot up, took more than 10 minutes to remediate, blew our SLO

Slide 86

Slide 86 text

Slide 87

Slide 87 text

Slide 88

Slide 88 text

V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. 2) Kafka: data bus Kafka provides durability ● Decoupling components provides safety. ● But introduces new dependencies. ● And things that can go wrong.

Slide 89

Slide 89 text

V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Our month of Kafka pain Read more: go.hny.co/kafka-lessons Longtime Conﬂuent Kafka users First to use Kafka on Graviton2 at scale Changed multiple variables at once ● move to tiered storage ● i3en → c6gn ● AWS Nitro

Slide 90

Slide 90 text

V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Unexpected constraints Read more: go.hny.co/kafka-lessons We thrashed multiple dimensions. We tickled hypervisor bugs. We tickled EBS bugs. Burning our people out wasn't worth it.

Slide 91

Slide 91 text

V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Existing incident response practices ● Escalate when you need a break / hand-off ● Remind (or enforce) time off work to make up for off-hours incident response Oﬃcial Honeycomb policy ● Incident responders are encouraged to expense meals for themselves and family during an incident Take care of your people

Slide 92

Slide 92 text

V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Ensure people don’t feel rushed. Complexity multiplies ● if a software program change takes t hours, ● software system change takes 3t hours ● software product change also takes 3t hours ● software system product change = 9t hours Maintain tight feedback loops, but not everything has an immediate impact. Optimize for safety Source: Code Complete, 2nd Ed.

Slide 93

Slide 93 text

V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Retriever is performance-critical ● It calls to Lambda for parallel compute ● Lambda use exploded. ● Could we address performance & cost? ● Maybe. 3) Retriever: query service

Slide 94

Slide 94 text

Slide 95

Slide 95 text

Slide 96

Slide 96 text

Slide 97

Slide 97 text

V6-21 Fast and reliable: pick both! Go faster, safely.

Slide 98

Slide 98 text

V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Takeaways 98 98 ● Design for reliability through full lifecycle. ● Feature ﬂags can keep us within SLO, most of the time. ● But even when they can't, ﬁnd other ways to mitigate risk. ● Discovering & spreading out risk improves customer experiences. ● Black swans happen; SLOs are a guideline, not a rule.

Slide 99

Slide 99 text

V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Examples of hidden risks ● Operational complexity ● Existing tech debt ● Vendor code and architecture ● Unexpected dependencies ● SSL certiﬁcates ● DNS Discover early and often through testing. Acknowledge hidden risks

Slide 100

Slide 100 text

Slide 101

Slide 101 text

V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Takeaways 101 101 ● We are part of sociotechnical systems: customers, engineers, stakeholders ● Outages and failed experiments are unscheduled learning opportunities ● Nothing happens without discussions between different people and teams ● Testing in production is fun AND good for customers ● Where should you start? DELIVERY TIME DELIVERY TIME DELIVERY TIME

Slide 102

Slide 102 text

V6-21 Understand & control production. Go faster on stable infra. Manage risk and iterate. 102