Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Observability and the Glorious Future (with Liz Fong-Jones)

Observability and the Glorious Future (with Liz Fong-Jones)

Charity Majors

November 17, 2022
Tweet

More Decks by Charity Majors

Other Decks in Technology

Transcript

  1. V6-21
    Charity Majors (slides by Liz Fong-Jones)
    CTO, Honeycomb
    @mipsytipsy at Infrastructure & Ops Superstream: Observability
    Observability
    And the Glorious Future
    w/ illustrations by @emilywithcurls!

    View Slide

  2. V6-21
    Observability is evolving quickly.
    2
    “Your bugs are
    evolving faster”

    View Slide

  3. V6-21
    3
    INSTRUMENT QUERY
    OPERATIONAL
    RESILIENCE
    MANAGED
    TECH DEBT
    QUALITY
    CODE
    PREDICTABLE
    RELEASE
    USER INSIGHT
    Outcomes Actions
    DATA
    And the problem space is complex.
    Anyone who tells you that you can
    just “buy their tool” and get a
    high-performing engineering team,
    is selling you something stupid

    View Slide

  4. V6-21
    Practitioners need velocity, reliability, & scalability.
    4
    You DO NOT ACTUALLY KNOW if
    your code is working or not until
    you have observed it in production

    View Slide

  5. V6-21
    A small but growing team builds Honeycomb.
    5

    View Slide

  6. V6-21
    We deploy with confidence.
    6

    View Slide

  7. V6-21
    7

    View Slide

  8. V6-21
    When it comes to software, speed
    is safety. Like ice skating, or
    bicycling.
    Speed up, gets easier. Slow
    down, gets wobblier.

    View Slide

  9. V6-21
    All while traffic has surged 3-5x in a year.

    View Slide

  10. V6-21
    Write workload, trailing year

    View Slide

  11. V6-21
    Read workload, trailing year

    View Slide

  12. V6-21
    Our confidence recipe:

    View Slide

  13. V6-21
    Quantify reliability.
    13
    “Always up” isn’t a number, dude.
    And if you think you’re “always
    up,” your telemetry is terrible.

    View Slide

  14. V6-21
    Identify potential areas of risk.
    So many teams never look at their
    instrumentation until something is
    paging them. That is why they
    suffer. They only respond to heart
    attacks instead of eating
    vegetables and minding their god
    damn cholesterol.

    View Slide

  15. V6-21
    Design experiments to probe risk.
    Outages are just experiments you
    didn’t think of yet :D

    View Slide

  16. V6-21
    Prioritize addressing risks.

    View Slide

  17. V6-21
    Measuring reliability:

    View Slide

  18. V6-21
    How broken is “too broken”?
    18

    View Slide

  19. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved.
    Service Level Objectives (SLOs)
    Define and measure success!
    Popularized by Google, widely adopted now!

    View Slide

  20. V6-21
    SLOs are common language.
    SLOs are the APIs between teams
    that allow you to budget and plan
    instead of reacting and arguing.
    Loose coupling FTW!

    View Slide

  21. V6-21
    Think in terms of events in context.
    21
    P.S. if you aren’t thinking in terms
    of (and capturing, and querying)
    arbitrarily-wide structured events,
    you are not doing observability.
    Rich context is the beating heart
    of observability.

    View Slide

  22. V6-21
    Is this event good or bad?
    22

    View Slide

  23. V6-21
    Honeycomb's SLOs reflect user value.
    23

    View Slide

  24. V6-21
    We make systems humane to run,
    24

    View Slide

  25. V6-21
    by ingesting telemetry,
    25

    View Slide

  26. V6-21
    enabling data exploration,
    26

    View Slide

  27. V6-21
    and empowering engineers.
    27

    View Slide

  28. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved.
    What Honeycomb does
    ● Ingests customer’s telemetry
    ● Indexes on every column
    ● Enables near-real-time querying
    on newly ingested data
    Data storage engine and analytics flow

    View Slide

  29. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved.
    SLOs are user flows
    Honeycomb’s SLOs
    ● home page loads quickly (99.9%)
    ● user-run queries are fast (99%)
    ● customer data gets ingested fast (99.99%)

    View Slide

  30. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved.
    Service-Level Objectives
    30
    30
    ● Example Service-Level Indicators:
    ○ 99.9% of queries succeed within 10 seconds over a period of 30 days.
    ○ 99.99% of events are processed without error in 5ms over 30 days.
    ● 99.9% ≈ 43 minutes of violation in a month.
    ● 99.99% ≈ 4.3 minutes of violation in a month.
    but services aren't just 100% down or 100% up.
    DEGRADATION IS UR FRIEND

    View Slide

  31. V6-21
    Data-driven decisions and tradeoffs.
    31

    View Slide

  32. V6-21
    Should we invest in more reliability?
    32

    View Slide

  33. V6-21
    Is it safe to do this risky experiment?
    33

    View Slide

  34. V6-21
    How to stay within SLO
    Simple answers, then more complicated answers

    View Slide

  35. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved. 35
    35
    Accelerate: State of DevOps 2021

    View Slide

  36. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved.
    What's our recipe?
    36
    36

    View Slide

  37. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved.
    Instrument as we code.
    37
    37

    View Slide

  38. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved.
    Functional and visual testing.
    38
    38

    View Slide

  39. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved.
    Design for feature flag deployment.
    39
    39

    View Slide

  40. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved.
    Automated integration & human review.
    40
    40

    View Slide

  41. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved.
    Green button merge.
    41
    41

    View Slide

  42. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved.
    Auto-updates, rollbacks, & pins.
    42
    42

    View Slide

  43. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved.
    Observe behavior in prod.
    43
    43
    No Friday Deploys
    Don’t Merge and Run!

    View Slide

  44. V6-21
    Repeatable infrastructure with code.

    View Slide

  45. V6-21
    If infra is code, we can use CI & flags!

    View Slide

  46. V6-21
    Ephemeral fleets & autoscaling.

    View Slide

  47. V6-21
    Quarantine bad traffic.
    It is possible to both do some crazy ass shit in
    production and protect your users from any
    noticeable effects. You just need the right tools.
    What, like you were ever going to find those bugs in
    staging?

    View Slide

  48. V6-21
    Validating our expectations

    View Slide

  49. V6-21
    Experiment using error budgets.

    View Slide

  50. V6-21
    Always ensure safety.
    50

    View Slide

  51. V6-21
    51

    View Slide

  52. V6-21
    Data persistence is tricky.

    View Slide

  53. V6-21
    Stateless request processing
    Stateful data storage

    View Slide

  54. V6-21

    View Slide

  55. V6-21
    Event batch
    Single event
    Single event
    Single event
    Partition queue
    Single event
    Single event
    Single event
    Partition queue
    Single event
    Single event
    Single event
    Partition queue
    Single event
    Single event
    Single event
    Indexing worker
    Field
    index
    Field
    index
    Field
    index
    S3
    Indexing worker
    Field
    index
    Field
    index
    Field
    index
    Indexing worker
    Field
    index
    Field
    index
    Field
    index

    View Slide

  56. V6-21
    Infrequent changes.

    View Slide

  57. V6-21
    Data integrity and consistency.

    View Slide

  58. V6-21
    Delicate failover dances

    View Slide

  59. V6-21
    Event batch
    Single event
    Single event
    Single event
    Partition queue
    Single event
    Single event
    Single event
    Partition queue
    Single event
    Single event
    Single event
    Partition queue
    Single event
    Single event
    Single event
    Indexing worker
    Field
    index
    Field
    index
    Field
    index
    S3
    Indexing worker
    Field
    index
    Field
    index
    Field
    index
    Indexing worker
    Field
    index
    Field
    index
    Field
    index

    View Slide

  60. V6-21
    Event batch
    Single event
    Single event
    Single event
    Partition queue
    Single event
    Single event
    Single event
    Partition queue
    Single event
    Single event
    Single event
    Partition queue
    Single event
    Single event
    Single event
    Indexing worker
    Field
    index
    Field
    index
    Field
    index
    S3
    Indexing worker
    Field
    index
    Field
    index
    Field
    index
    Indexing worker
    Field
    index
    Field
    index
    Field
    index

    View Slide

  61. V6-21
    Event batch
    Single event
    Single event
    Single event
    Partition queue
    Single event
    Single event
    Single event
    Partition queue
    Single event
    Single event
    Single event
    Partition queue
    Single event
    Single event
    Single event
    Indexing worker
    Field
    index
    Field
    index
    Field
    index
    S3
    Indexing worker
    Field
    index
    Field
    index
    Field
    index
    Indexing worker
    Field
    index
    Field
    index
    Field
    index

    View Slide

  62. V6-21
    Event batch
    Single event
    Single event
    Single event
    Partition queue
    Single event
    Single event
    Single event
    Partition queue
    Single event
    Single event
    Single event
    Partition queue
    Single event
    Single event
    Single event
    Indexing worker
    Field
    index
    Field
    index
    Field
    index
    S3
    Indexing replay
    Field
    index
    Field
    index
    Field
    index
    Indexing worker
    Field
    index
    Field
    index
    Field
    index

    View Slide

  63. V6-21
    Experimenting in prod

    View Slide

  64. V6-21
    Restart one server & service at a time.
    64
    The goal is to test, not
    to destroy.

    View Slide

  65. V6-21
    At 3pm, not at 3am.
    65

    View Slide

  66. V6-21
    "Bugs are shallow with more eyes."
    66

    View Slide

  67. V6-21
    Monitor for changes using SLIs.
    67
    Monitoring isn’t a bad word, it just
    isn’t observability.
    SLOs are a modern form of
    monitoring.

    View Slide

  68. V6-21
    Debug with observability.
    68

    View Slide

  69. V6-21
    Test the telemetry too!
    69

    View Slide

  70. V6-21
    Verify fixes by repeating.
    70

    View Slide

  71. V6-21
    Event batch
    Single event
    Single event
    Single event
    Partition queue
    Single event
    Single event
    Single event
    Partition queue
    Single event
    Single event
    Single event
    Partition queue
    Single event
    Single event
    Single event
    Indexing worker
    Field
    index
    Field
    index
    Field
    index
    S3
    Indexing worker
    Field
    index
    Field
    index
    Field
    index
    Indexing worker
    Field
    index
    Field
    index
    Field
    index

    View Slide

  72. V6-21
    Event batch
    Single event
    Single event
    Single event
    Partition queue
    Single event
    Single event
    Single event
    Partition queue
    Single event
    Single event
    Single event
    Partition queue
    Single event
    Single event
    Single event
    Indexing worker
    Field
    index
    Field
    index
    Field
    index
    S3
    Indexing worker
    Field
    index
    Field
    index
    Field
    index
    Indexing worker
    Field
    index
    Field
    index
    Field
    index
    FORESHADOWING

    View Slide

  73. V6-21
    Alerting worker
    Alerting worker
    Zookeeper cluster
    Yes, it is 2022 and people
    are still running zookeeper.
    People like us.

    View Slide

  74. V6-21
    Alerting worker
    Alerting worker
    Zookeeper cluster

    View Slide

  75. V6-21
    Alerting worker
    Alerting worker
    Zookeeper cluster

    View Slide

  76. V6-21
    76
    De-risk with design & automation.

    View Slide

  77. V6-21
    Partition queue
    Single event
    Single event
    Single event
    Partition queue
    Single event
    Single event
    Single event
    Partition queue
    Single event
    Single event
    Single event
    Indexing worker
    Field
    index
    Field
    index
    Field
    index
    Indexing worker
    Field
    index
    Field
    index
    Field
    index
    Indexing worker
    Field
    index
    Field
    index
    Field
    index
    Indexing worker
    Field
    index
    Field
    index
    Field
    index
    Indexing worker
    Field
    index
    Field
    index
    Field
    index
    Indexing worker
    Field
    index
    Field
    index
    Field
    index
    S3

    View Slide

  78. V6-21
    78
    Continuously verify to stop regression.

    View Slide

  79. V6-21
    Save money with flexibility.
    79

    View Slide

  80. V6-21
    ARM64 hosts
    Spot instances

    View Slide

  81. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved.
    Non-trivial savings.
    Production Shepherd EC2 cost, grouped by instance type

    View Slide

  82. V6-21
    Not every experiment succeeds.
    But you can mitigate the risks.

    View Slide

  83. V6-21
    ● Ingest service crash
    ● Kafka instability
    ● Query performance degradation
    and what we learned from each.
    Three case studies of failure

    View Slide

  84. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved.
    1) Shepherd: ingest API service
    Shepherd is the gateway to all ingest
    ● highest-traffic service
    ● stateless service
    ● cares about throughput first, latency
    close second
    ● used compressed JSON
    ● gRPC was needed.

    View Slide

  85. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved. 85
    85
    Honeycomb Ingest Outage
    ● In November, we were working on OTLP and gRPC ingest support
    ● Let a commit deploy that attempted to bind to a privileged port
    ● Stopped the deploy in time, but scale-ups were trying to use the new build
    ● Latency shot up, took more than 10 minutes to remediate, blew our SLO

    View Slide

  86. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved. 86
    86
    Now what?
    ● We could freeze deploys (oh no, don’t do this!)
    ● Delay the launch? We considered this...
    ● Get creative!

    View Slide

  87. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved.
    Reduce Risk
    87
    87

    View Slide

  88. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved.
    2) Kafka: data bus
    Kafka provides durability
    ● Decoupling components provides safety.
    ● But introduces new dependencies.
    ● And things that can go wrong.

    View Slide

  89. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved.
    Our month of Kafka pain
    Read more: go.hny.co/kafka-lessons
    Longtime Confluent Kafka users
    First to use Kafka on Graviton2 at scale
    Changed multiple variables at once
    ● move to tiered storage
    ● i3en → c6gn
    ● AWS Nitro

    View Slide

  90. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved.
    Unexpected constraints
    Read more: go.hny.co/kafka-lessons
    We thrashed multiple dimensions.
    We tickled hypervisor bugs.
    We tickled EBS bugs.
    Burning our people out wasn't worth it.

    View Slide

  91. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved.
    Existing incident response practices
    ● Escalate when you need a break /
    hand-off
    ● Remind (or enforce) time off work to
    make up for off-hours incident response
    Official Honeycomb policy
    ● Incident responders are encouraged to
    expense meals for themselves and
    family during an incident
    Take care of your people

    View Slide

  92. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved.
    Ensure people don’t feel rushed.
    Complexity multiplies
    ● if a software program change takes t
    hours,
    ● software system change takes 3t hours
    ● software product change also takes 3t
    hours
    ● software system product change = 9t
    hours
    Maintain tight feedback loops, but not
    everything has an immediate impact.
    Optimize for safety
    Source: Code Complete, 2nd Ed.

    View Slide

  93. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved.
    Retriever is performance-critical
    ● It calls to Lambda for parallel compute
    ● Lambda use exploded.
    ● Could we address performance & cost?
    ● Maybe.
    3) Retriever: query service

    View Slide

  94. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved. 94
    94

    View Slide

  95. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved. 95
    95

    View Slide

  96. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved. 96
    96
    Making progress carefully

    View Slide

  97. V6-21
    Fast and reliable: pick both!
    Go faster, safely.

    View Slide

  98. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved.
    Takeaways
    98
    98
    ● Design for reliability through full lifecycle.
    ● Feature flags can keep us within SLO, most of the time.
    ● But even when they can't, find other ways to mitigate risk.
    ● Discovering & spreading out risk improves customer experiences.
    ● Black swans happen; SLOs are a guideline, not a rule.

    View Slide

  99. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved.
    Examples of hidden risks
    ● Operational complexity
    ● Existing tech debt
    ● Vendor code and architecture
    ● Unexpected dependencies
    ● SSL certificates
    ● DNS
    Discover early and often through testing.
    Acknowledge hidden risks

    View Slide

  100. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved.
    Make experimentation routine!
    100
    100

    View Slide

  101. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved.
    Takeaways
    101
    101
    ● We are part of sociotechnical systems: customers, engineers, stakeholders
    ● Outages and failed experiments are unscheduled learning opportunities
    ● Nothing happens without discussions between different people and teams
    ● Testing in production is fun AND good for customers
    ● Where should you start? DELIVERY TIME DELIVERY TIME DELIVERY TIME

    View Slide

  102. V6-21
    Understand & control production.
    Go faster on stable infra.
    Manage risk and iterate.
    102

    View Slide

  103. V6-21
    Read our blog! hny.co/blog
    We're hiring! hny.co/careers
    Find out more

    View Slide

  104. V6-21
    www.honeycomb.io

    View Slide