Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Observability and the Glorious Future (with Liz Fong-Jones)

Observability and the Glorious Future (with Liz Fong-Jones)

Charity Majors

November 17, 2022
Tweet

More Decks by Charity Majors

Other Decks in Technology

Transcript

  1. V6-21 Charity Majors (slides by Liz Fong-Jones) CTO, Honeycomb @mipsytipsy

    at Infrastructure & Ops Superstream: Observability Observability And the Glorious Future w/ illustrations by @emilywithcurls!
  2. V6-21 3 INSTRUMENT QUERY OPERATIONAL RESILIENCE MANAGED TECH DEBT QUALITY

    CODE PREDICTABLE RELEASE USER INSIGHT Outcomes Actions DATA And the problem space is complex. Anyone who tells you that you can just “buy their tool” and get a high-performing engineering team, is selling you something stupid
  3. V6-21 Practitioners need velocity, reliability, & scalability. 4 You DO

    NOT ACTUALLY KNOW if your code is working or not until you have observed it in production
  4. V6-21 When it comes to software, speed is safety. Like

    ice skating, or bicycling. Speed up, gets easier. Slow down, gets wobblier.
  5. V6-21 Quantify reliability. 13 “Always up” isn’t a number, dude.

    And if you think you’re “always up,” your telemetry is terrible.
  6. V6-21 Identify potential areas of risk. So many teams never

    look at their instrumentation until something is paging them. That is why they suffer. They only respond to heart attacks instead of eating vegetables and minding their god damn cholesterol.
  7. V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Service

    Level Objectives (SLOs) Define and measure success! Popularized by Google, widely adopted now!
  8. V6-21 SLOs are common language. SLOs are the APIs between

    teams that allow you to budget and plan instead of reacting and arguing. Loose coupling FTW!
  9. V6-21 Think in terms of events in context. 21 P.S.

    if you aren’t thinking in terms of (and capturing, and querying) arbitrarily-wide structured events, you are not doing observability. Rich context is the beating heart of observability.
  10. V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. What

    Honeycomb does • Ingests customer’s telemetry • Indexes on every column • Enables near-real-time querying on newly ingested data Data storage engine and analytics flow
  11. V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. SLOs

    are user flows Honeycomb’s SLOs • home page loads quickly (99.9%) • user-run queries are fast (99%) • customer data gets ingested fast (99.99%)
  12. V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Service-Level

    Objectives 30 30 • Example Service-Level Indicators: ◦ 99.9% of queries succeed within 10 seconds over a period of 30 days. ◦ 99.99% of events are processed without error in 5ms over 30 days. • 99.9% ≈ 43 minutes of violation in a month. • 99.99% ≈ 4.3 minutes of violation in a month. but services aren't just 100% down or 100% up. DEGRADATION IS UR FRIEND
  13. V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Observe

    behavior in prod. 43 43 No Friday Deploys Don’t Merge and Run!
  14. V6-21 Quarantine bad traffic. It is possible to both do

    some crazy ass shit in production and protect your users from any noticeable effects. You just need the right tools. What, like you were ever going to find those bugs in staging?
  15. V6-21 Event batch Single event Single event Single event Partition

    queue Single event Single event Single event Partition queue Single event Single event Single event Partition queue Single event Single event Single event Indexing worker Field index Field index Field index S3 Indexing worker Field index Field index Field index Indexing worker Field index Field index Field index
  16. V6-21 Event batch Single event Single event Single event Partition

    queue Single event Single event Single event Partition queue Single event Single event Single event Partition queue Single event Single event Single event Indexing worker Field index Field index Field index S3 Indexing worker Field index Field index Field index Indexing worker Field index Field index Field index
  17. V6-21 Event batch Single event Single event Single event Partition

    queue Single event Single event Single event Partition queue Single event Single event Single event Partition queue Single event Single event Single event Indexing worker Field index Field index Field index S3 Indexing worker Field index Field index Field index Indexing worker Field index Field index Field index
  18. V6-21 Event batch Single event Single event Single event Partition

    queue Single event Single event Single event Partition queue Single event Single event Single event Partition queue Single event Single event Single event Indexing worker Field index Field index Field index S3 Indexing worker Field index Field index Field index Indexing worker Field index Field index Field index
  19. V6-21 Event batch Single event Single event Single event Partition

    queue Single event Single event Single event Partition queue Single event Single event Single event Partition queue Single event Single event Single event Indexing worker Field index Field index Field index S3 Indexing replay Field index Field index Field index Indexing worker Field index Field index Field index
  20. V6-21 Restart one server & service at a time. 64

    The goal is to test, not to destroy.
  21. V6-21 Monitor for changes using SLIs. 67 Monitoring isn’t a

    bad word, it just isn’t observability. SLOs are a modern form of monitoring.
  22. V6-21 Event batch Single event Single event Single event Partition

    queue Single event Single event Single event Partition queue Single event Single event Single event Partition queue Single event Single event Single event Indexing worker Field index Field index Field index S3 Indexing worker Field index Field index Field index Indexing worker Field index Field index Field index
  23. V6-21 Event batch Single event Single event Single event Partition

    queue Single event Single event Single event Partition queue Single event Single event Single event Partition queue Single event Single event Single event Indexing worker Field index Field index Field index S3 Indexing worker Field index Field index Field index Indexing worker Field index Field index Field index FORESHADOWING
  24. V6-21 Alerting worker Alerting worker Zookeeper cluster Yes, it is

    2022 and people are still running zookeeper. People like us.
  25. V6-21 Partition queue Single event Single event Single event Partition

    queue Single event Single event Single event Partition queue Single event Single event Single event Indexing worker Field index Field index Field index Indexing worker Field index Field index Field index Indexing worker Field index Field index Field index Indexing worker Field index Field index Field index Indexing worker Field index Field index Field index Indexing worker Field index Field index Field index S3
  26. V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Non-trivial

    savings. Production Shepherd EC2 cost, grouped by instance type
  27. V6-21 • Ingest service crash • Kafka instability • Query

    performance degradation and what we learned from each. Three case studies of failure
  28. V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. 1)

    Shepherd: ingest API service Shepherd is the gateway to all ingest • highest-traffic service • stateless service • cares about throughput first, latency close second • used compressed JSON • gRPC was needed.
  29. V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. 85

    85 Honeycomb Ingest Outage • In November, we were working on OTLP and gRPC ingest support • Let a commit deploy that attempted to bind to a privileged port • Stopped the deploy in time, but scale-ups were trying to use the new build • Latency shot up, took more than 10 minutes to remediate, blew our SLO
  30. V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. 86

    86 Now what? • We could freeze deploys (oh no, don’t do this!) • Delay the launch? We considered this... • Get creative!
  31. V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. 2)

    Kafka: data bus Kafka provides durability • Decoupling components provides safety. • But introduces new dependencies. • And things that can go wrong.
  32. V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Our

    month of Kafka pain Read more: go.hny.co/kafka-lessons Longtime Confluent Kafka users First to use Kafka on Graviton2 at scale Changed multiple variables at once • move to tiered storage • i3en → c6gn • AWS Nitro
  33. V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Unexpected

    constraints Read more: go.hny.co/kafka-lessons We thrashed multiple dimensions. We tickled hypervisor bugs. We tickled EBS bugs. Burning our people out wasn't worth it.
  34. V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Existing

    incident response practices • Escalate when you need a break / hand-off • Remind (or enforce) time off work to make up for off-hours incident response Official Honeycomb policy • Incident responders are encouraged to expense meals for themselves and family during an incident Take care of your people
  35. V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Ensure

    people don’t feel rushed. Complexity multiplies • if a software program change takes t hours, • software system change takes 3t hours • software product change also takes 3t hours • software system product change = 9t hours Maintain tight feedback loops, but not everything has an immediate impact. Optimize for safety Source: Code Complete, 2nd Ed.
  36. V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Retriever

    is performance-critical • It calls to Lambda for parallel compute • Lambda use exploded. • Could we address performance & cost? • Maybe. 3) Retriever: query service
  37. V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Takeaways

    98 98 • Design for reliability through full lifecycle. • Feature flags can keep us within SLO, most of the time. • But even when they can't, find other ways to mitigate risk. • Discovering & spreading out risk improves customer experiences. • Black swans happen; SLOs are a guideline, not a rule.
  38. V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Examples

    of hidden risks • Operational complexity • Existing tech debt • Vendor code and architecture • Unexpected dependencies • SSL certificates • DNS Discover early and often through testing. Acknowledge hidden risks
  39. V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Takeaways

    101 101 • We are part of sociotechnical systems: customers, engineers, stakeholders • Outages and failed experiments are unscheduled learning opportunities • Nothing happens without discussions between different people and teams • Testing in production is fun AND good for customers • Where should you start? DELIVERY TIME DELIVERY TIME DELIVERY TIME