Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Observability and the Glorious Future (with Liz Fong-Jones)

Observability and the Glorious Future (with Liz Fong-Jones)

Charity Majors

November 17, 2022
Tweet

More Decks by Charity Majors

Other Decks in Technology

Transcript

  1. V6-21
    Charity Majors (slides by Liz Fong-Jones)
    CTO, Honeycomb
    @mipsytipsy at Infrastructure & Ops Superstream: Observability
    Observability
    And the Glorious Future
    w/ illustrations by @emilywithcurls!

    View full-size slide

  2. V6-21
    Observability is evolving quickly.
    2
    “Your bugs are
    evolving faster”

    View full-size slide

  3. V6-21
    3
    INSTRUMENT QUERY
    OPERATIONAL
    RESILIENCE
    MANAGED
    TECH DEBT
    QUALITY
    CODE
    PREDICTABLE
    RELEASE
    USER INSIGHT
    Outcomes Actions
    DATA
    And the problem space is complex.
    Anyone who tells you that you can
    just “buy their tool” and get a
    high-performing engineering team,
    is selling you something stupid

    View full-size slide

  4. V6-21
    Practitioners need velocity, reliability, & scalability.
    4
    You DO NOT ACTUALLY KNOW if
    your code is working or not until
    you have observed it in production

    View full-size slide

  5. V6-21
    A small but growing team builds Honeycomb.
    5

    View full-size slide

  6. V6-21
    We deploy with confidence.
    6

    View full-size slide

  7. V6-21
    When it comes to software, speed
    is safety. Like ice skating, or
    bicycling.
    Speed up, gets easier. Slow
    down, gets wobblier.

    View full-size slide

  8. V6-21
    All while traffic has surged 3-5x in a year.

    View full-size slide

  9. V6-21
    Write workload, trailing year

    View full-size slide

  10. V6-21
    Read workload, trailing year

    View full-size slide

  11. V6-21
    Our confidence recipe:

    View full-size slide

  12. V6-21
    Quantify reliability.
    13
    “Always up” isn’t a number, dude.
    And if you think you’re “always
    up,” your telemetry is terrible.

    View full-size slide

  13. V6-21
    Identify potential areas of risk.
    So many teams never look at their
    instrumentation until something is
    paging them. That is why they
    suffer. They only respond to heart
    attacks instead of eating
    vegetables and minding their god
    damn cholesterol.

    View full-size slide

  14. V6-21
    Design experiments to probe risk.
    Outages are just experiments you
    didn’t think of yet :D

    View full-size slide

  15. V6-21
    Prioritize addressing risks.

    View full-size slide

  16. V6-21
    Measuring reliability:

    View full-size slide

  17. V6-21
    How broken is “too broken”?
    18

    View full-size slide

  18. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved.
    Service Level Objectives (SLOs)
    Define and measure success!
    Popularized by Google, widely adopted now!

    View full-size slide

  19. V6-21
    SLOs are common language.
    SLOs are the APIs between teams
    that allow you to budget and plan
    instead of reacting and arguing.
    Loose coupling FTW!

    View full-size slide

  20. V6-21
    Think in terms of events in context.
    21
    P.S. if you aren’t thinking in terms
    of (and capturing, and querying)
    arbitrarily-wide structured events,
    you are not doing observability.
    Rich context is the beating heart
    of observability.

    View full-size slide

  21. V6-21
    Is this event good or bad?
    22

    View full-size slide

  22. V6-21
    Honeycomb's SLOs reflect user value.
    23

    View full-size slide

  23. V6-21
    We make systems humane to run,
    24

    View full-size slide

  24. V6-21
    by ingesting telemetry,
    25

    View full-size slide

  25. V6-21
    enabling data exploration,
    26

    View full-size slide

  26. V6-21
    and empowering engineers.
    27

    View full-size slide

  27. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved.
    What Honeycomb does
    ● Ingests customer’s telemetry
    ● Indexes on every column
    ● Enables near-real-time querying
    on newly ingested data
    Data storage engine and analytics flow

    View full-size slide

  28. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved.
    SLOs are user flows
    Honeycomb’s SLOs
    ● home page loads quickly (99.9%)
    ● user-run queries are fast (99%)
    ● customer data gets ingested fast (99.99%)

    View full-size slide

  29. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved.
    Service-Level Objectives
    30
    30
    ● Example Service-Level Indicators:
    ○ 99.9% of queries succeed within 10 seconds over a period of 30 days.
    ○ 99.99% of events are processed without error in 5ms over 30 days.
    ● 99.9% ≈ 43 minutes of violation in a month.
    ● 99.99% ≈ 4.3 minutes of violation in a month.
    but services aren't just 100% down or 100% up.
    DEGRADATION IS UR FRIEND

    View full-size slide

  30. V6-21
    Data-driven decisions and tradeoffs.
    31

    View full-size slide

  31. V6-21
    Should we invest in more reliability?
    32

    View full-size slide

  32. V6-21
    Is it safe to do this risky experiment?
    33

    View full-size slide

  33. V6-21
    How to stay within SLO
    Simple answers, then more complicated answers

    View full-size slide

  34. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved. 35
    35
    Accelerate: State of DevOps 2021

    View full-size slide

  35. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved.
    What's our recipe?
    36
    36

    View full-size slide

  36. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved.
    Instrument as we code.
    37
    37

    View full-size slide

  37. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved.
    Functional and visual testing.
    38
    38

    View full-size slide

  38. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved.
    Design for feature flag deployment.
    39
    39

    View full-size slide

  39. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved.
    Automated integration & human review.
    40
    40

    View full-size slide

  40. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved.
    Green button merge.
    41
    41

    View full-size slide

  41. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved.
    Auto-updates, rollbacks, & pins.
    42
    42

    View full-size slide

  42. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved.
    Observe behavior in prod.
    43
    43
    No Friday Deploys
    Don’t Merge and Run!

    View full-size slide

  43. V6-21
    Repeatable infrastructure with code.

    View full-size slide

  44. V6-21
    If infra is code, we can use CI & flags!

    View full-size slide

  45. V6-21
    Ephemeral fleets & autoscaling.

    View full-size slide

  46. V6-21
    Quarantine bad traffic.
    It is possible to both do some crazy ass shit in
    production and protect your users from any
    noticeable effects. You just need the right tools.
    What, like you were ever going to find those bugs in
    staging?

    View full-size slide

  47. V6-21
    Validating our expectations

    View full-size slide

  48. V6-21
    Experiment using error budgets.

    View full-size slide

  49. V6-21
    Always ensure safety.
    50

    View full-size slide

  50. V6-21
    Data persistence is tricky.

    View full-size slide

  51. V6-21
    Stateless request processing
    Stateful data storage

    View full-size slide

  52. V6-21
    Event batch
    Single event
    Single event
    Single event
    Partition queue
    Single event
    Single event
    Single event
    Partition queue
    Single event
    Single event
    Single event
    Partition queue
    Single event
    Single event
    Single event
    Indexing worker
    Field
    index
    Field
    index
    Field
    index
    S3
    Indexing worker
    Field
    index
    Field
    index
    Field
    index
    Indexing worker
    Field
    index
    Field
    index
    Field
    index

    View full-size slide

  53. V6-21
    Infrequent changes.

    View full-size slide

  54. V6-21
    Data integrity and consistency.

    View full-size slide

  55. V6-21
    Delicate failover dances

    View full-size slide

  56. V6-21
    Event batch
    Single event
    Single event
    Single event
    Partition queue
    Single event
    Single event
    Single event
    Partition queue
    Single event
    Single event
    Single event
    Partition queue
    Single event
    Single event
    Single event
    Indexing worker
    Field
    index
    Field
    index
    Field
    index
    S3
    Indexing worker
    Field
    index
    Field
    index
    Field
    index
    Indexing worker
    Field
    index
    Field
    index
    Field
    index

    View full-size slide

  57. V6-21
    Event batch
    Single event
    Single event
    Single event
    Partition queue
    Single event
    Single event
    Single event
    Partition queue
    Single event
    Single event
    Single event
    Partition queue
    Single event
    Single event
    Single event
    Indexing worker
    Field
    index
    Field
    index
    Field
    index
    S3
    Indexing worker
    Field
    index
    Field
    index
    Field
    index
    Indexing worker
    Field
    index
    Field
    index
    Field
    index

    View full-size slide

  58. V6-21
    Event batch
    Single event
    Single event
    Single event
    Partition queue
    Single event
    Single event
    Single event
    Partition queue
    Single event
    Single event
    Single event
    Partition queue
    Single event
    Single event
    Single event
    Indexing worker
    Field
    index
    Field
    index
    Field
    index
    S3
    Indexing worker
    Field
    index
    Field
    index
    Field
    index
    Indexing worker
    Field
    index
    Field
    index
    Field
    index

    View full-size slide

  59. V6-21
    Event batch
    Single event
    Single event
    Single event
    Partition queue
    Single event
    Single event
    Single event
    Partition queue
    Single event
    Single event
    Single event
    Partition queue
    Single event
    Single event
    Single event
    Indexing worker
    Field
    index
    Field
    index
    Field
    index
    S3
    Indexing replay
    Field
    index
    Field
    index
    Field
    index
    Indexing worker
    Field
    index
    Field
    index
    Field
    index

    View full-size slide

  60. V6-21
    Experimenting in prod

    View full-size slide

  61. V6-21
    Restart one server & service at a time.
    64
    The goal is to test, not
    to destroy.

    View full-size slide

  62. V6-21
    At 3pm, not at 3am.
    65

    View full-size slide

  63. V6-21
    "Bugs are shallow with more eyes."
    66

    View full-size slide

  64. V6-21
    Monitor for changes using SLIs.
    67
    Monitoring isn’t a bad word, it just
    isn’t observability.
    SLOs are a modern form of
    monitoring.

    View full-size slide

  65. V6-21
    Debug with observability.
    68

    View full-size slide

  66. V6-21
    Test the telemetry too!
    69

    View full-size slide

  67. V6-21
    Verify fixes by repeating.
    70

    View full-size slide

  68. V6-21
    Event batch
    Single event
    Single event
    Single event
    Partition queue
    Single event
    Single event
    Single event
    Partition queue
    Single event
    Single event
    Single event
    Partition queue
    Single event
    Single event
    Single event
    Indexing worker
    Field
    index
    Field
    index
    Field
    index
    S3
    Indexing worker
    Field
    index
    Field
    index
    Field
    index
    Indexing worker
    Field
    index
    Field
    index
    Field
    index

    View full-size slide

  69. V6-21
    Event batch
    Single event
    Single event
    Single event
    Partition queue
    Single event
    Single event
    Single event
    Partition queue
    Single event
    Single event
    Single event
    Partition queue
    Single event
    Single event
    Single event
    Indexing worker
    Field
    index
    Field
    index
    Field
    index
    S3
    Indexing worker
    Field
    index
    Field
    index
    Field
    index
    Indexing worker
    Field
    index
    Field
    index
    Field
    index
    FORESHADOWING

    View full-size slide

  70. V6-21
    Alerting worker
    Alerting worker
    Zookeeper cluster
    Yes, it is 2022 and people
    are still running zookeeper.
    People like us.

    View full-size slide

  71. V6-21
    Alerting worker
    Alerting worker
    Zookeeper cluster

    View full-size slide

  72. V6-21
    Alerting worker
    Alerting worker
    Zookeeper cluster

    View full-size slide

  73. V6-21
    76
    De-risk with design & automation.

    View full-size slide

  74. V6-21
    Partition queue
    Single event
    Single event
    Single event
    Partition queue
    Single event
    Single event
    Single event
    Partition queue
    Single event
    Single event
    Single event
    Indexing worker
    Field
    index
    Field
    index
    Field
    index
    Indexing worker
    Field
    index
    Field
    index
    Field
    index
    Indexing worker
    Field
    index
    Field
    index
    Field
    index
    Indexing worker
    Field
    index
    Field
    index
    Field
    index
    Indexing worker
    Field
    index
    Field
    index
    Field
    index
    Indexing worker
    Field
    index
    Field
    index
    Field
    index
    S3

    View full-size slide

  75. V6-21
    78
    Continuously verify to stop regression.

    View full-size slide

  76. V6-21
    Save money with flexibility.
    79

    View full-size slide

  77. V6-21
    ARM64 hosts
    Spot instances

    View full-size slide

  78. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved.
    Non-trivial savings.
    Production Shepherd EC2 cost, grouped by instance type

    View full-size slide

  79. V6-21
    Not every experiment succeeds.
    But you can mitigate the risks.

    View full-size slide

  80. V6-21
    ● Ingest service crash
    ● Kafka instability
    ● Query performance degradation
    and what we learned from each.
    Three case studies of failure

    View full-size slide

  81. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved.
    1) Shepherd: ingest API service
    Shepherd is the gateway to all ingest
    ● highest-traffic service
    ● stateless service
    ● cares about throughput first, latency
    close second
    ● used compressed JSON
    ● gRPC was needed.

    View full-size slide

  82. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved. 85
    85
    Honeycomb Ingest Outage
    ● In November, we were working on OTLP and gRPC ingest support
    ● Let a commit deploy that attempted to bind to a privileged port
    ● Stopped the deploy in time, but scale-ups were trying to use the new build
    ● Latency shot up, took more than 10 minutes to remediate, blew our SLO

    View full-size slide

  83. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved. 86
    86
    Now what?
    ● We could freeze deploys (oh no, don’t do this!)
    ● Delay the launch? We considered this...
    ● Get creative!

    View full-size slide

  84. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved.
    Reduce Risk
    87
    87

    View full-size slide

  85. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved.
    2) Kafka: data bus
    Kafka provides durability
    ● Decoupling components provides safety.
    ● But introduces new dependencies.
    ● And things that can go wrong.

    View full-size slide

  86. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved.
    Our month of Kafka pain
    Read more: go.hny.co/kafka-lessons
    Longtime Confluent Kafka users
    First to use Kafka on Graviton2 at scale
    Changed multiple variables at once
    ● move to tiered storage
    ● i3en → c6gn
    ● AWS Nitro

    View full-size slide

  87. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved.
    Unexpected constraints
    Read more: go.hny.co/kafka-lessons
    We thrashed multiple dimensions.
    We tickled hypervisor bugs.
    We tickled EBS bugs.
    Burning our people out wasn't worth it.

    View full-size slide

  88. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved.
    Existing incident response practices
    ● Escalate when you need a break /
    hand-off
    ● Remind (or enforce) time off work to
    make up for off-hours incident response
    Official Honeycomb policy
    ● Incident responders are encouraged to
    expense meals for themselves and
    family during an incident
    Take care of your people

    View full-size slide

  89. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved.
    Ensure people don’t feel rushed.
    Complexity multiplies
    ● if a software program change takes t
    hours,
    ● software system change takes 3t hours
    ● software product change also takes 3t
    hours
    ● software system product change = 9t
    hours
    Maintain tight feedback loops, but not
    everything has an immediate impact.
    Optimize for safety
    Source: Code Complete, 2nd Ed.

    View full-size slide

  90. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved.
    Retriever is performance-critical
    ● It calls to Lambda for parallel compute
    ● Lambda use exploded.
    ● Could we address performance & cost?
    ● Maybe.
    3) Retriever: query service

    View full-size slide

  91. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved. 94
    94

    View full-size slide

  92. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved. 95
    95

    View full-size slide

  93. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved. 96
    96
    Making progress carefully

    View full-size slide

  94. V6-21
    Fast and reliable: pick both!
    Go faster, safely.

    View full-size slide

  95. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved.
    Takeaways
    98
    98
    ● Design for reliability through full lifecycle.
    ● Feature flags can keep us within SLO, most of the time.
    ● But even when they can't, find other ways to mitigate risk.
    ● Discovering & spreading out risk improves customer experiences.
    ● Black swans happen; SLOs are a guideline, not a rule.

    View full-size slide

  96. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved.
    Examples of hidden risks
    ● Operational complexity
    ● Existing tech debt
    ● Vendor code and architecture
    ● Unexpected dependencies
    ● SSL certificates
    ● DNS
    Discover early and often through testing.
    Acknowledge hidden risks

    View full-size slide

  97. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved.
    Make experimentation routine!
    100
    100

    View full-size slide

  98. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved.
    Takeaways
    101
    101
    ● We are part of sociotechnical systems: customers, engineers, stakeholders
    ● Outages and failed experiments are unscheduled learning opportunities
    ● Nothing happens without discussions between different people and teams
    ● Testing in production is fun AND good for customers
    ● Where should you start? DELIVERY TIME DELIVERY TIME DELIVERY TIME

    View full-size slide

  99. V6-21
    Understand & control production.
    Go faster on stable infra.
    Manage risk and iterate.
    102

    View full-size slide

  100. V6-21
    Read our blog! hny.co/blog
    We're hiring! hny.co/careers
    Find out more

    View full-size slide

  101. V6-21
    www.honeycomb.io

    View full-size slide