Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Observability and the Glorious Future (with Liz Fong-Jones)

Observability and the Glorious Future (with Liz Fong-Jones)

Charity Majors

November 17, 2022
Tweet

More Decks by Charity Majors

Other Decks in Technology

Transcript

  1. V6-21 Charity Majors (slides by Liz Fong-Jones) CTO, Honeycomb @mipsytipsy

    at Infrastructure & Ops Superstream: Observability Observability And the Glorious Future w/ illustrations by @emilywithcurls!
  2. V6-21 Observability is evolving quickly. 2 “Your bugs are evolving

    faster”
  3. V6-21 3 INSTRUMENT QUERY OPERATIONAL RESILIENCE MANAGED TECH DEBT QUALITY

    CODE PREDICTABLE RELEASE USER INSIGHT Outcomes Actions DATA And the problem space is complex. Anyone who tells you that you can just “buy their tool” and get a high-performing engineering team, is selling you something stupid
  4. V6-21 Practitioners need velocity, reliability, & scalability. 4 You DO

    NOT ACTUALLY KNOW if your code is working or not until you have observed it in production
  5. V6-21 A small but growing team builds Honeycomb. 5

  6. V6-21 We deploy with confidence. 6

  7. V6-21 7

  8. V6-21 When it comes to software, speed is safety. Like

    ice skating, or bicycling. Speed up, gets easier. Slow down, gets wobblier.
  9. V6-21 All while traffic has surged 3-5x in a year.

  10. V6-21 Write workload, trailing year

  11. V6-21 Read workload, trailing year

  12. V6-21 Our confidence recipe:

  13. V6-21 Quantify reliability. 13 “Always up” isn’t a number, dude.

    And if you think you’re “always up,” your telemetry is terrible.
  14. V6-21 Identify potential areas of risk. So many teams never

    look at their instrumentation until something is paging them. That is why they suffer. They only respond to heart attacks instead of eating vegetables and minding their god damn cholesterol.
  15. V6-21 Design experiments to probe risk. Outages are just experiments

    you didn’t think of yet :D
  16. V6-21 Prioritize addressing risks.

  17. V6-21 Measuring reliability:

  18. V6-21 How broken is “too broken”? 18

  19. V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Service

    Level Objectives (SLOs) Define and measure success! Popularized by Google, widely adopted now!
  20. V6-21 SLOs are common language. SLOs are the APIs between

    teams that allow you to budget and plan instead of reacting and arguing. Loose coupling FTW!
  21. V6-21 Think in terms of events in context. 21 P.S.

    if you aren’t thinking in terms of (and capturing, and querying) arbitrarily-wide structured events, you are not doing observability. Rich context is the beating heart of observability.
  22. V6-21 Is this event good or bad? 22

  23. V6-21 Honeycomb's SLOs reflect user value. 23

  24. V6-21 We make systems humane to run, 24

  25. V6-21 by ingesting telemetry, 25

  26. V6-21 enabling data exploration, 26

  27. V6-21 and empowering engineers. 27

  28. V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. What

    Honeycomb does • Ingests customer’s telemetry • Indexes on every column • Enables near-real-time querying on newly ingested data Data storage engine and analytics flow
  29. V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. SLOs

    are user flows Honeycomb’s SLOs • home page loads quickly (99.9%) • user-run queries are fast (99%) • customer data gets ingested fast (99.99%)
  30. V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Service-Level

    Objectives 30 30 • Example Service-Level Indicators: ◦ 99.9% of queries succeed within 10 seconds over a period of 30 days. ◦ 99.99% of events are processed without error in 5ms over 30 days. • 99.9% ≈ 43 minutes of violation in a month. • 99.99% ≈ 4.3 minutes of violation in a month. but services aren't just 100% down or 100% up. DEGRADATION IS UR FRIEND
  31. V6-21 Data-driven decisions and tradeoffs. 31

  32. V6-21 Should we invest in more reliability? 32

  33. V6-21 Is it safe to do this risky experiment? 33

  34. V6-21 How to stay within SLO Simple answers, then more

    complicated answers
  35. V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. 35

    35 Accelerate: State of DevOps 2021
  36. V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. What's

    our recipe? 36 36
  37. V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Instrument

    as we code. 37 37
  38. V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Functional

    and visual testing. 38 38
  39. V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Design

    for feature flag deployment. 39 39
  40. V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Automated

    integration & human review. 40 40
  41. V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Green

    button merge. 41 41
  42. V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Auto-updates,

    rollbacks, & pins. 42 42
  43. V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Observe

    behavior in prod. 43 43 No Friday Deploys Don’t Merge and Run!
  44. V6-21 Repeatable infrastructure with code.

  45. V6-21 If infra is code, we can use CI &

    flags!
  46. V6-21 Ephemeral fleets & autoscaling.

  47. V6-21 Quarantine bad traffic. It is possible to both do

    some crazy ass shit in production and protect your users from any noticeable effects. You just need the right tools. What, like you were ever going to find those bugs in staging?
  48. V6-21 Validating our expectations

  49. V6-21 Experiment using error budgets.

  50. V6-21 Always ensure safety. 50

  51. V6-21 51

  52. V6-21 Data persistence is tricky.

  53. V6-21 Stateless request processing Stateful data storage

  54. V6-21

  55. V6-21 Event batch Single event Single event Single event Partition

    queue Single event Single event Single event Partition queue Single event Single event Single event Partition queue Single event Single event Single event Indexing worker Field index Field index Field index S3 Indexing worker Field index Field index Field index Indexing worker Field index Field index Field index
  56. V6-21 Infrequent changes.

  57. V6-21 Data integrity and consistency.

  58. V6-21 Delicate failover dances

  59. V6-21 Event batch Single event Single event Single event Partition

    queue Single event Single event Single event Partition queue Single event Single event Single event Partition queue Single event Single event Single event Indexing worker Field index Field index Field index S3 Indexing worker Field index Field index Field index Indexing worker Field index Field index Field index
  60. V6-21 Event batch Single event Single event Single event Partition

    queue Single event Single event Single event Partition queue Single event Single event Single event Partition queue Single event Single event Single event Indexing worker Field index Field index Field index S3 Indexing worker Field index Field index Field index Indexing worker Field index Field index Field index
  61. V6-21 Event batch Single event Single event Single event Partition

    queue Single event Single event Single event Partition queue Single event Single event Single event Partition queue Single event Single event Single event Indexing worker Field index Field index Field index S3 Indexing worker Field index Field index Field index Indexing worker Field index Field index Field index
  62. V6-21 Event batch Single event Single event Single event Partition

    queue Single event Single event Single event Partition queue Single event Single event Single event Partition queue Single event Single event Single event Indexing worker Field index Field index Field index S3 Indexing replay Field index Field index Field index Indexing worker Field index Field index Field index
  63. V6-21 Experimenting in prod

  64. V6-21 Restart one server & service at a time. 64

    The goal is to test, not to destroy.
  65. V6-21 At 3pm, not at 3am. 65

  66. V6-21 "Bugs are shallow with more eyes." 66

  67. V6-21 Monitor for changes using SLIs. 67 Monitoring isn’t a

    bad word, it just isn’t observability. SLOs are a modern form of monitoring.
  68. V6-21 Debug with observability. 68

  69. V6-21 Test the telemetry too! 69

  70. V6-21 Verify fixes by repeating. 70

  71. V6-21 Event batch Single event Single event Single event Partition

    queue Single event Single event Single event Partition queue Single event Single event Single event Partition queue Single event Single event Single event Indexing worker Field index Field index Field index S3 Indexing worker Field index Field index Field index Indexing worker Field index Field index Field index
  72. V6-21 Event batch Single event Single event Single event Partition

    queue Single event Single event Single event Partition queue Single event Single event Single event Partition queue Single event Single event Single event Indexing worker Field index Field index Field index S3 Indexing worker Field index Field index Field index Indexing worker Field index Field index Field index FORESHADOWING
  73. V6-21 Alerting worker Alerting worker Zookeeper cluster Yes, it is

    2022 and people are still running zookeeper. People like us.
  74. V6-21 Alerting worker Alerting worker Zookeeper cluster

  75. V6-21 Alerting worker Alerting worker Zookeeper cluster

  76. V6-21 76 De-risk with design & automation.

  77. V6-21 Partition queue Single event Single event Single event Partition

    queue Single event Single event Single event Partition queue Single event Single event Single event Indexing worker Field index Field index Field index Indexing worker Field index Field index Field index Indexing worker Field index Field index Field index Indexing worker Field index Field index Field index Indexing worker Field index Field index Field index Indexing worker Field index Field index Field index S3
  78. V6-21 78 Continuously verify to stop regression.

  79. V6-21 Save money with flexibility. 79

  80. V6-21 ARM64 hosts Spot instances

  81. V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Non-trivial

    savings. Production Shepherd EC2 cost, grouped by instance type
  82. V6-21 Not every experiment succeeds. But you can mitigate the

    risks.
  83. V6-21 • Ingest service crash • Kafka instability • Query

    performance degradation and what we learned from each. Three case studies of failure
  84. V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. 1)

    Shepherd: ingest API service Shepherd is the gateway to all ingest • highest-traffic service • stateless service • cares about throughput first, latency close second • used compressed JSON • gRPC was needed.
  85. V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. 85

    85 Honeycomb Ingest Outage • In November, we were working on OTLP and gRPC ingest support • Let a commit deploy that attempted to bind to a privileged port • Stopped the deploy in time, but scale-ups were trying to use the new build • Latency shot up, took more than 10 minutes to remediate, blew our SLO
  86. V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. 86

    86 Now what? • We could freeze deploys (oh no, don’t do this!) • Delay the launch? We considered this... • Get creative!
  87. V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Reduce

    Risk 87 87
  88. V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. 2)

    Kafka: data bus Kafka provides durability • Decoupling components provides safety. • But introduces new dependencies. • And things that can go wrong.
  89. V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Our

    month of Kafka pain Read more: go.hny.co/kafka-lessons Longtime Confluent Kafka users First to use Kafka on Graviton2 at scale Changed multiple variables at once • move to tiered storage • i3en → c6gn • AWS Nitro
  90. V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Unexpected

    constraints Read more: go.hny.co/kafka-lessons We thrashed multiple dimensions. We tickled hypervisor bugs. We tickled EBS bugs. Burning our people out wasn't worth it.
  91. V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Existing

    incident response practices • Escalate when you need a break / hand-off • Remind (or enforce) time off work to make up for off-hours incident response Official Honeycomb policy • Incident responders are encouraged to expense meals for themselves and family during an incident Take care of your people
  92. V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Ensure

    people don’t feel rushed. Complexity multiplies • if a software program change takes t hours, • software system change takes 3t hours • software product change also takes 3t hours • software system product change = 9t hours Maintain tight feedback loops, but not everything has an immediate impact. Optimize for safety Source: Code Complete, 2nd Ed.
  93. V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Retriever

    is performance-critical • It calls to Lambda for parallel compute • Lambda use exploded. • Could we address performance & cost? • Maybe. 3) Retriever: query service
  94. V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. 94

    94
  95. V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. 95

    95
  96. V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. 96

    96 Making progress carefully
  97. V6-21 Fast and reliable: pick both! Go faster, safely.

  98. V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Takeaways

    98 98 • Design for reliability through full lifecycle. • Feature flags can keep us within SLO, most of the time. • But even when they can't, find other ways to mitigate risk. • Discovering & spreading out risk improves customer experiences. • Black swans happen; SLOs are a guideline, not a rule.
  99. V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Examples

    of hidden risks • Operational complexity • Existing tech debt • Vendor code and architecture • Unexpected dependencies • SSL certificates • DNS Discover early and often through testing. Acknowledge hidden risks
  100. V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Make

    experimentation routine! 100 100
  101. V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Takeaways

    101 101 • We are part of sociotechnical systems: customers, engineers, stakeholders • Outages and failed experiments are unscheduled learning opportunities • Nothing happens without discussions between different people and teams • Testing in production is fun AND good for customers • Where should you start? DELIVERY TIME DELIVERY TIME DELIVERY TIME
  102. V6-21 Understand & control production. Go faster on stable infra.

    Manage risk and iterate. 102
  103. V6-21 Read our blog! hny.co/blog We're hiring! hny.co/careers Find out

    more
  104. V6-21 www.honeycomb.io