Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Observability and the Glorious Future

Observability and the Glorious Future

Slides from the O'Reilly infra/ops superstream, based off an ADDO keynote from Liz Fong Jones.

How do modern teams develop and ship code and debug it in production? We give an overview of the Honeycomb backend, then discuss some chaos engineering experiments and SLO violations, and how we use fine-tuned modern production tooling to increase engineering efficiency and scale our services by continually testing them. In production.

Includes speaker notes.

Charity Majors

January 12, 2022
Tweet

More Decks by Charity Majors

Other Decks in Technology

Transcript

  1. V6-21
    Charity Majors (slides by Liz Fong-Jones)


    CTO, Honeycomb


    @mipsytipsy at Infrastructure & Ops Superstream: Observability
    Observability


    And the Glorious Future
    w/ illustrations by @emilywithcurls!
    “You may experience a bit of cognitive dissonance during this talk, since you are probably familiar with liz’s slide style, even handedness, and general diplomatic approach. I have tried to
    play liz on TV, it didn’t fool anybody. So anytime you see an annoying little pony pop up mouthing off: don’t blame liz.”

    View Slide

  2. V6-21
    Observability is evolving quickly.
    2
    “Your bugs are
    evolving faster”
    Everybody here has heard me fucking talk about observability and what it is. So instead, we’re going to walk you through the future we’re already living in at honeycomb,


    o11y is the ability to understand our systems without deploying new instrumentation


    The o11y space has a lot of product requirements that are evolving quickly


    - large volume of data to analyze


    - there are increasing demands users have on tooling

    View Slide

  3. V6-21
    3
    INSTRUMENT QUERY
    OPERATIONAL
    RESILIENCE
    MANAGED
    TECH DEBT
    QUALITY
    CODE
    PREDICTABLE
    RELEASE
    USER INSIGHT
    Outcomes Actions
    DATA
    And the problem space is complex.
    Anyone who tells you that you can
    just “buy their tool” and get a high-
    performing engineering team, is
    selling you something stupid
    We care about predictable releases, quality code, managing tech debt, operational resilience, user insights. The


    Observability isn’t frosting you put on the cake after you bake it. It’s about ensuring that your code is written correctly, performing well, doing its job for each and every user


    Code goes in from the IDE, and comes out your o11y tool


    How do you get developers to instrument their code? How do you store the metadata about the data, which may be multiple sizes of the data? And none of it matters if you can’t actually ask the right questions when you need to.

    View Slide

  4. V6-21
    Practitioners need velocity, reliability, & scalability.
    4
    You DO NOT ACTUALLY KNOW
    if your code is working or not until
    you have observed it in production
    A lot of people seem to feel like these are in tension with each other.


    Product velocity vs reliability or scalability

    View Slide

  5. V6-21
    A small but growing team builds Honeycomb.
    5
    At Honeycomb we’re a small engineering team, so we have to be very deliberate about where we invest our time, and have automation that speeds us up rather than slowing us down


    We have about 100 people now, and 40 engineers, which is 4x as many as we had two years ago. We’re 6 years in now, and for the first 4 years we had 4-10 engineers. Sales used to beg
    me not to tell anyone how few engineers we had, whereas i always wanted to shout it from the rooftops. Can you BELIEVE the shit that we have built and how quickly we can move? I
    LOVED it when people would gawp and say they thought we had fifty engineers.

    View Slide

  6. V6-21
    We deploy with confidence.
    6
    One of the things that has always helped us compete is that we don’t have to think about deploys. You merge some code, it gets rolled out to dogfood, prod etc automatically


    On top of that, we comfortably deploy on Fridays. Obviously. Why would we sacrifice 20% of our velocity? Worse yet, why would we let merges pile up for monday?


    We deploy every weekday and avoid deploying on weekends

    View Slide

  7. V6-21
    7
    One of the things that has always helped us compete is that we don’t have to think about deploys. You merge some code, it gets rolled out to dogfood, prod etc automatically


    On top of that, we comfortably deploy on Fridays. Obviously. Why would we sacrifice 20% of our velocity? Worse yet, why would we let merges pile up for monday?


    We deploy every weekday and avoid deploying on weekends

    View Slide

  8. V6-21
    When it comes to software, speed
    is safety. Like ice skating, or
    bicycling.


    Speed up, gets easier. Slow
    down, gets wobblier.
    Here’s what that looks like


    This graph shows the number of distinct build_ids running in our systems per day


    We ship between 10-14x per day


    This is what high agility looks like for a dozen engineers


    Despite this, we almost never have emergencies that need a weekend deploy. Wait, I mean BECAUSE OF THIS.

    View Slide

  9. V6-21
    All while traffic has surged 3-5x in a year.
    I would like to remind you that we are running the combined production loads of several hundred customers. Depending on how they instrumented that code, perhaps multiples of that their traffic.


    We’ve been doing all this during the pandemic, while shit accelerates.


    And if you think this arrow is a bullshit excuse for a grahp, we’ve got better ones.

    View Slide

  10. V6-21
    Write workload, trailing year
    Writes have tripled

    View Slide

  11. V6-21
    Read workload, trailing year
    Reads have 3-5x’d


    This is a lot of scaling for a team to have to do on the fly, while shipping product constantly, and also laying the foundation for future product work by refactoring and paying down debt.

    View Slide

  12. V6-21
    Our confidence recipe:
    5:00


    We talk a pretty big game. So how do we balance all these competing priorities? How do we know where to spend our incredibly precious, scarce hours?

    View Slide

  13. V6-21
    Quantify reliability.
    13
    “Always up” isn’t a number, dude.
    And if you think you’re “always
    up,” your telemetry is terrible.
    Not just tech but cultural processes that reflected our values


    Prioritizing high agility on the product and maintaining reliability, and figuring out that sweet spot where we maintain both

    View Slide

  14. V6-21
    Identify potential areas of risk.
    So many teams never look at their
    instrumentation until something is
    paging them. That is why they
    suffer. They only respond to heart
    attacks instead of eating
    vegetables and minding their god
    damn cholesterol.


    This requires continuous improvement, which means addressing the entropy that inevitably takes hold in our systems


    Proactively looking at what’s slowing us down from shipping and investing our time in fixing that when it starts to have an impact


    If you wait for it to page you before you examine your code, it’s like counting on a quadruple bypass instead of

    View Slide

  15. V6-21
    Design experiments to probe risk.
    Outages are just experiments you
    didn’t think of yet :D

    View Slide

  16. V6-21
    Prioritize addressing risks.
    Engineers run and own their own code.

    View Slide

  17. V6-21
    Measuring reliability:
    7:00

    View Slide

  18. V6-21
    How broken is “too broken”?
    18
    How broken is “too broken”? How do we measure that?


    —-- (next: intro to SLOs)


    The system should survive LOTS of failures. Never alert on symptoms or disk space or CPUs.

    View Slide

  19. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved.
    Service Level Objectives (SLOs)
    Define and measure success!


    Popularized by Google, widely adopted now!
    at Honeycomb we use service level objectives


    which represent a common language between engineering and business stakeholders


    we define what success means according to the business


    and measure it with our system telemetry throughout the lifecycle of a customer


    that’s how we know how well our services are doing


    and it’s how we measure the impact of changes

    View Slide

  20. V6-21
    SLOs are common language.
    SLOs are the APIs between
    teams that allow you to budget
    and plan instead of reacting and
    arguing. Loose coupling FTW!
    They’re a tool we use as a team internally to talk about service health and reliability.


    SLOs are the API between teams. They allow you to budget and prepare instead of just reacting and arguing

    View Slide

  21. V6-21
    Think in terms of events in context.
    21
    P.S. if you aren’t thinking in terms
    of (and capturing, and querying)
    arbitrarily-wide structured events,
    you are not doing observability.
    Rich context is the beating heart
    of observability.
    What events are flowing thru your system, and what’s all the metadata?

    View Slide

  22. V6-21
    Is this event good or bad?
    22
    [event from above being sorted by a robot or machine into the good or bad piles]

    View Slide

  23. V6-21
    Honeycomb's SLOs reflect user value.
    23
    And the strictness of those SLOs depends on the reliability that users expect from each service.


    SLOs serve no purpose unless they reflect actual customer pain and experience.

    View Slide

  24. V6-21
    We make systems humane to run,
    24
    Honeycomb’s goal as a product is to help you run your systems humanely


    Without waking you up in the middle of the night


    For you to tear your hair out trying to figure out what’s wrong

    View Slide

  25. V6-21
    by ingesting telemetry,
    25
    The way we do that is by ingesting your systems’ telemetry data

    View Slide

  26. V6-21
    enabling data exploration,
    26
    And then making it easy to explore that data


    By asking ad-hoc, novel questions


    Not pre-aggregated queries but anything you might think of

    View Slide

  27. V6-21
    and empowering engineers.
    27
    And then we make your queries run performantly enough so that you feel empowered as an engineer to understand what’s happening in your systems. Exploration requires sub-second
    results.


    Not an easy problem.

    View Slide

  28. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved.
    What Honeycomb does


    ● Ingests customer’s telemetry


    ● Indexes on every column


    ● Enables near-real-time querying

    on newly ingested data
    Data storage engine and analytics flow
    Honeycomb is a data storage engine and analytics tool


    we ingest our customer’s telemetry data and and then we enable fast querying on that data

    View Slide

  29. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved.
    SLOs are user flows
    Honeycomb’s SLOs


    ● home page loads quickly (99.9%)


    ● user-run queries are fast (99%)


    ● customer data gets ingested fast (99.99%)
    SLOs are for service behavior that has customer impact


    at Honeycomb we want to ensure things like


    the in-app home page should load quickly with data


    that user-run queries should return results fast


    and that customer data we’re trying to ingest should be stored fast and successfully


    these are the sorts of things that our product managers and customer support teams frequently talk to engineering about


    However, if a customer runs a query of some crazy complexity… it can take up to 10 sec


    It’s ok if one fails once in a while. But our top priority is ingest. We want to get it out of our customers’ RAM and into honeycomb as quickly as possible.

    View Slide

  30. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved.
    Service-Level Objectives
    30
    30
    ● Example Service-Level Indicators:


    ○ 99.9% of queries succeed within 10 seconds over a period of 30 days.


    ○ 99.99% of events are processed without error in 5ms over 30 days.

    ● 99.9% ≈ 43 minutes of violation in a month.


    ● 99.99% ≈ 4.3 minutes of violation in a month.


    but services aren't just 100% down or 100% up.
    DEGRADATION IS UR FRIEND
    Fortunately, services are rarely 100% up or down. If services are degraded by 1%, then we have 4300 minutes to investigate and fix the problem

    View Slide

  31. V6-21
    Data-driven decisions and tradeoffs.
    31
    Charity: making failure budgets explicit.

    View Slide

  32. V6-21
    Should we invest in more reliability?
    32

    View Slide

  33. V6-21
    Is it safe to do this risky experiment?
    33
    Too much is as bad as too little. We bneed to induce risks to rehearse, or we can move faster

    View Slide

  34. V6-21
    How to stay within SLO
    Simple answers, then more complicated answers
    20:00

    View Slide

  35. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved. 35
    35
    Accelerate: State of DevOps 2021
    You can have many small breaks, but not painful ones.


    Elite teams can afford to fail quickly

    View Slide

  36. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved.
    What's our recipe?
    36
    36
    How do we go about turning lines of code into a live service in prod, as quickly and reliably as possible?


    View Slide

  37. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved.
    Instrument as we code.
    37
    37
    We practice observability-driven development.


    Before we even start implementing a feature we ask,


    “How is this going to behave in production?”


    and then we add instrumentation for that.



    Our instrumentation generates not just flat logs but rich structured events that we can query and dig down into the context.

    View Slide

  38. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved.
    Functional and visual testing.
    38
    38
    We don’t stop there. We lean on our tests to give us confidence long before the code hits prod.


    You need both meaningful tests and rich instrumentation.


    Not clicking around, but using libraries and user stories, so we c

    View Slide

  39. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved.
    Design for feature flag deployment.
    39
    39
    We intentionally factor our code to make it easy to use feature flags,


    which allows us to separate deploys from releases and manage the blast radius of changes


    Roll out new features as no-ops from the user perspective


    Then we can turn on a flag in a specific environment or for a canary subset of traffic, and then ramp it up to everybody


    But we have a single build, the same code running across all environments.

    View Slide

  40. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved.
    Automated integration & human review.
    40
    40
    That union of human and technology is what makes our team a socio-technical system.


    You need to pay attention to both, so I made sure our CI robot friend gets a high five.


    All of our builds complete within 10 min, so you aren’t going to get distracted and walk away. If your code reviewer asks for a change, you can get to it quickly. Tight feedback loops

    View Slide

  41. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved.
    Green button merge.
    41
    41
    Once the CI is green and reviewers give the thumbs up, we merge.


    No “wait until after lunch” or “let’s do it tomorrow.”


    Why? Want to push and observe those changes while the context is still fresh in our heads.


    We merge and automatically deploy every day of the work week.

    View Slide

  42. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved.
    Auto-updates, rollbacks, & pins.
    42
    42
    We’ll talk more about how our code auto-updates across environments, and the situations when we’ll do a rollback or pin a specific version


    We roll it out thru three environments: kibble, dogfood, then prod.

    View Slide

  43. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved.
    Observe behavior in prod.
    43
    43
    No Friday Deploys


    Don’t Merge and Run!
    And finally we bring it full-circle. We observe the behavior of our change in production using the instrumentation we added at the beginning.


    Your job is not done until you close the loop and observe it in prod.


    We check right away and then a little bit later to see how it’s behaving with real traffic and real users.


    It’s not “no friday deploys”, it’s “don’t merge and run”


    ——- (next: three environments)

    View Slide

  44. V6-21
    Repeatable infrastructure with code.
    All our infrastructure is code under version control. All changes are subject to peer review and go through a build process.


    It’s like gardening, you need to be proactive about pulling weeds. This way we never have changes in the wild.

    View Slide

  45. V6-21
    If infra is code, we can use CI & flags!
    On top of that, we use cloud services to manage our Terraform state.


    We used to have people applying infrastructure changes from their local dev environments


    using their individual AWS credentials.


    With a central place to manage those changes, we can for example, limit our human AWS user permissions to be much safer.


    We use Terraform Cloud, and they’re kinda the experts on Terraform.


    We don’t have to spend a bunch of engineering resources standing up systems to manage our Terraform state for us.


    They already have a handle on it.

    View Slide

  46. V6-21
    Ephemeral fleets & autoscaling.
    We can turn on or off AWS spot in our autoscaling groups


    and feature flags allow us to say,


    Hey under certain circumstances, let’s stand up a special fleet


    It’s pretty dope with you can use terraform variables to control whether or not infra is up. We can automatically provision ephemerlal fleets to catch up if we fall behind in our most important
    workloads.

    View Slide

  47. V6-21
    Quarantine bad traffic.
    It is possible to both do some crazy ass shit in
    production and protect your users from any
    noticeable effects. You just need the right tools.


    What, like you were ever going to find those bugs in
    staging?
    If we have a misbehaving user we can quarantine them to a subset of our infrastructure


    We can set up a set of paths that get quarantined so we can keep it from crashing the main fleets, or do more rigorous testing.


    It is possible to both do crazy shit in production and protect your users from the noticeable effects


    So we can observe how their behavior affects our systems with like CPU profiling or memory profiling


    and prevent them from affecting other users


    —- (to Shelby)

    View Slide

  48. V6-21
    Validating our expectations
    25:00


    View Slide

  49. V6-21
    Experiment using error budgets.
    You may be familiar with the four key DORA metrics and the research published in the Accelerate book.


    These metrics aren’t independent data points.


    you can actually create positive virtuous cycles when you improve even one of those metrics.


    And that’s how we did it at Honeycomb.


    If you have extra budget, stick some chaos in there

    View Slide

  50. V6-21
    Always ensure safety.
    50
    Chaos engineering is _engineering_, not just pure chaos.


    And if you don’t have observability, you probably just have chaos.

    View Slide

  51. V6-21
    51
    We can use feature flags for an experiment on a subset of users, or internal users

    View Slide

  52. V6-21
    Data persistence is tricky.
    That works really well for stateless stuff but not when each request is not independent or you have data sitting on disk.

    View Slide

  53. V6-21
    Stateless request processing

    Stateful data storage
    How do we handle a data-driven service that allows us to become confident in that service?


    All frontend services are stateless, of course. But we also have a lot of kafka, retriever, and myslq


    We deploy our i nfra changes incrementally to reduce the blast radius


    We’re able to do that because we can deploy multiple a times in a day.


    There’s not a lot of manual overhead.


    So we can test the effects of changes to our infrastructure with much lower risk

    View Slide

  54. V6-21
    Let’s zoom in on the stateful part of that infra diagram


    We deploy our infra changes incrementally to reduce the blast radius


    We’re able to do that because we can deploy multiple a times in a day.


    There’s not a lot of manual overhead.


    So we can test the effects of changes to our infrastructure with much lower risk

    View Slide

  55. V6-21
    Event batch
    Single event
    Single event
    Single event
    Partition queue
    Single event
    Single event
    Single event
    Partition queue
    Single event
    Single event
    Single event
    Partition queue
    Single event
    Single event
    Single event
    Indexing worker
    Field
    inde
    x
    Field
    inde
    x
    Field
    inde
    x
    S3
    Indexing worker
    Field
    inde
    x
    Field
    inde
    x
    Field
    inde
    x
    Indexing worker
    Field
    inde
    x
    Field
    inde
    x
    Field
    inde
    x
    Data flows in to shepherd, and that constitutes a batch of events, on the left. What do we do with those? We split then apart, asnd send them to the appropriate partition. If one partition is
    not writable,


    The middle tier here is kafka. Within a given kafka partition, Events here are preserved in order. This is important, because If you don’t have a deterministgic ordering, it’s very hard to
    ensure data integrity because you won’t have a idempotent or reliable source of where is this data coming from and what should i expect to see.


    The indexing workers decomposes them into one file per column per set of attributes. So “service name” comes in from multiple events, and then on each indexing worker service name
    from multiple events becomes its own file that it’s appended to in order. Based off the order it’s read from kafka. And then we’ll finally go ahead and tail it off to aws s3.

    View Slide

  56. V6-21
    Infrequent changes.
    What are the risks?


    Well, kafka doesn’t change much. We update it maybe every couple months. They are also on very stable machines that don’t change very often. Unlike shepherd, which we run on spot
    instances and it’s copnstantly churning, we make sure kafka is on stable machines. They rarely get restarted without us. We have to make sure we can

    View Slide

  57. V6-21
    Data integrity and consistency.
    We have to make sure we can survive the disappearance of any individual node, while also not having our nodes churn too very often.

    View Slide

  58. V6-21
    Delicate failover dances
    There’s a very delicate failover dance that has to happen whenever we lose a stateful node, whether that is kafka, zookeeper, or retriever.

    View Slide

  59. V6-21
    Event batch
    Single event
    Single event
    Single event
    Partition queue
    Single event
    Single event
    Single event
    Partition queue
    Single event
    Single event
    Single event
    Partition queue
    Single event
    Single event
    Single event
    Indexing worker
    Field
    inde
    x
    Field
    inde
    x
    Field
    inde
    x
    S3
    Indexing worker
    Field
    inde
    x
    Field
    inde
    x
    Field
    inde
    x
    Indexing worker
    Field
    inde
    x
    Field
    inde
    x
    Field
    inde
    x
    So what happens if we lose a kafka broker? What’s supposed to happen, is all brokers have replicas on other brokers.

    View Slide

  60. V6-21
    Event batch
    Single event
    Single event
    Single event
    Partition queue
    Single event
    Single event
    Single event
    Partition queue
    Single event
    Single event
    Single event
    Partition queue
    Single event
    Single event
    Single event
    Indexing worker
    Field
    inde
    x
    Field
    inde
    x
    Field
    inde
    x
    S3
    Indexing worker
    Field
    inde
    x
    Field
    inde
    x
    Field
    inde
    x
    Indexing worker
    Field
    inde
    x
    Field
    inde
    x
    Field
    inde
    x
    When there’s a new kafka node available, it receives all of the old partitions that the old kafka node is responsible for, and may or may not get promoted to leader in its own due time


    View Slide

  61. V6-21
    Event batch
    Single event
    Single event
    Single event
    Partition queue
    Single event
    Single event
    Single event
    Partition queue
    Single event
    Single event
    Single event
    Partition queue
    Single event
    Single event
    Single event
    Indexing worker
    Field
    inde
    x
    Field
    inde
    x
    Field
    inde
    x
    S3
    Indexing worker
    Field
    inde
    x
    Field
    inde
    x
    Field
    inde
    x
    Indexing worker
    Field
    inde
    x
    Field
    inde
    x
    Field
    inde
    x
    If we lose an indexing worker, well we don’t run a single indexer per partition, we run two. The other thing that’s supposed to happen is

    View Slide

  62. V6-21
    Event batch
    Single event
    Single event
    Single event
    Partition queue
    Single event
    Single event
    Single event
    Partition queue
    Single event
    Single event
    Single event
    Partition queue
    Single event
    Single event
    Single event
    Indexing worker
    Field
    inde
    x
    Field
    inde
    x
    Field
    inde
    x
    S3
    Indexing replay
    Field
    inde
    x
    Field
    inde
    x
    Field
    inde
    x
    Indexing worker
    Field
    inde
    x
    Field
    inde
    x
    Field
    inde
    x
    We are supposed to be able to replay and restore that indexing worker, either from a peer — the original design, about 5 years ago — or from filesystem snapshots. Either way you have a
    stale set of data that you’re replaying from a backup, and now that ordering of that partition cue becomes really important, right? Because you KNOW where you snapshot, and you can
    replay that partition forward. and your snapshot is no more than an hour old then you only have to replay the most recent hour. Great! So, how can we test this??

    View Slide

  63. V6-21
    Experimenting in prod
    This is how we continuously test our kafkas and retrievers to make sure they’re doing what we expect.

    View Slide

  64. V6-21
    Restart one server & service at a time.
    64
    The goal is to test, not
    to destroy.
    First of all, we’re testing, not destroying. One server from one service at a time. These are calculated risks, so calcculate

    View Slide

  65. V6-21
    At 3pm, not at 3am.
    65
    Don’t be an asshole


    You want to help people practice when things go wrong, and you want to practice under peak condition!

    View Slide

  66. V6-21
    "Bugs are shallow with more eyes."
    66

    View Slide

  67. V6-21
    Monitor for changes using SLIs.
    67
    Monitoring isn’t a bad word, it just
    isn’t observability.


    SLOs are a modern form of
    monitoring.
    Monitoring isn’t a bad word, it just isn’t observability. Let’s monitor our SLIs. did we impact our monitoring?


    SLOs are a modern form of monitoring.

    View Slide

  68. V6-21
    Debug with observability.
    68
    When something does go wrong, it’s probably something you didn’t anticipate (duh) which means you rely on instrumentation and rich events to explore and ask new questions.

    View Slide

  69. V6-21
    Test the telemetry too!
    69
    It’s not enough to just test the node. What if you replace a kafka node, but the node continues reporting that it’s healthy? Even if it got successfully replaced, this can inhibit your ability to
    debug


    We think it’s important to use chaos engineering not to just test our systems but also our ability to observe our systems.

    View Slide

  70. V6-21
    Verify fixes by repeating.
    70
    If something broke and you fixed it, don’t assume it’s fixed til you try

    View Slide

  71. V6-21
    Event batch
    Single event
    Single event
    Single event
    Partition queue
    Single event
    Single event
    Single event
    Partition queue
    Single event
    Single event
    Single event
    Partition queue
    Single event
    Single event
    Single event
    Indexing worker
    Field
    inde
    x
    Field
    inde
    x
    Field
    inde
    x
    S3
    Indexing worker
    Field
    inde
    x
    Field
    inde
    x
    Field
    inde
    x
    Indexing worker
    Field
    inde
    x
    Field
    inde
    x
    Field
    inde
    x
    Let’s talk about this hypothesis again. What if you lose a kafka node, and the new one doesn’t come back up?


    View Slide

  72. V6-21
    Event batch
    Single event
    Single event
    Single event
    Partition queue
    Single event
    Single event
    Single event
    Partition queue
    Single event
    Single event
    Single event
    Partition queue
    Single event
    Single event
    Single event
    Indexing worker
    Field
    inde
    x
    Field
    inde
    x
    Field
    inde
    x
    S3
    Indexing worker
    Field
    inde
    x
    Field
    inde
    x
    Field
    inde
    x
    Indexing worker
    Field
    inde
    x
    Field
    inde
    x
    Field
    inde
    x
    FORESHADOWING
    Turns out that testing your automatic kafka balancing is SUPER important. We caught all kinds of interesting things that have happened inside the kafka rebalancer simply by killing nodes
    and seeing whether they come back successfully and start serving traffic again.


    We need to know this, because if there’s a major outage and we aren’t able to reshuffle the data on demand, this can be a serious emergency. It can manifest as disks filling up rapidly, if
    you have five nodes consuming the data normally handled by six . And if you’re doing this during daytime peak, and if you’re also trying to catch up a brand new kafka broker at the same
    time, that can overload the system.

    View Slide

  73. V6-21
    Alerting worker
    Alerting worker
    Zookeeper cluster
    Yes, it is 2022 and people
    are still running zookeeper.
    People like us.
    Let’s talk about another category of failure we’ve found through testing!!


    So honeycomb lets our customers send themselves alerts, on any defined query if something is wrong. According to the criteria they gave us.


    We want you to get exactly one alert, not duplicates, not zero. So how do we do this? We elect a leader to run the alerts for a given minute using zookeeper. Zookeeper is redundancy,
    right?! Let’s kill one and find out!

    View Slide

  74. V6-21
    Alerting worker
    Alerting worker
    Zookeeper cluster
    Annnnd the alerts didn’t run.


    Why? Well, both alerting workers were configured to try to talk to index zero only.


    We killed a node twice, no problem Third time, we killed index zero

    View Slide

  75. V6-21
    Alerting worker
    Alerting worker
    Zookeeper cluster
    I replaced index node zero And the learning workers didn’t run. So we discovered at 3pm, not 3am a bug that would eventually have bitten us in the ass and made customers unhappy.
    Mitigation, of course, was just to make sure that oru zookeeper client is talking to all of the zookeeper nodes.

    View Slide

  76. V6-21
    76
    De-risk with design & automation.

    View Slide

  77. V6-21
    Partition queue
    Single event
    Single event
    Single event
    Partition queue
    Single event
    Single event
    Single event
    Partition queue
    Single event
    Single event
    Single event
    Indexing worker
    Field
    inde
    x
    Field
    inde
    x
    Field
    inde
    x
    Indexing worker
    Field
    inde
    x
    Field
    inde
    x
    Field
    inde
    x
    Indexing worker
    Field
    inde
    x
    Field
    inde
    x
    Field
    inde
    x
    Indexing worker
    Field
    inde
    x
    Field
    inde
    x
    Field
    inde
    x
    Indexing worker
    Field
    inde
    x
    Field
    inde
    x
    Field
    inde
    x
    Indexing worker
    Field
    inde
    x
    Field
    inde
    x
    Field
    inde
    x
    S3
    Our previous design for retrievers was that if one went down, the other would rsync off its buddy to recover. But what if you lose two indexing workers at the same time from the same
    partition?


    Eh, that’ll never happen.


    So as we’re cycling our retriever fleet, or in the middle of moving them to a new class of instances, wouldn’t it be nice if it didn’t feel like stepping very very carefully through a crowded
    minefield of danger to make sure you never hit two of the same ?


    What if, instead of having to worry about your peers all the time, you could just replay off the aws snapshot? Makes your bootstrap choice a lot more reliable.


    The more workers we have over time, the scarier that was going to become. So yeah we’re albe to do things now that we can restore workers on demand. And we continuously

    View Slide

  78. V6-21
    78
    Continuously verify to stop regression.
    Every monday at 3 pm, we kill the oldest retriever node


    Every tuesday at 3 pnm, we klil the oldest kafka node


    \


    That way we can verify continuously that our node replacement systems are working properly and that we are adequately provisioned to handle losing a node during peak systems. How
    often do you think we get paged about this at night, now?

    View Slide

  79. V6-21
    Save money with flexibility.
    79
    Finally, we want user queries to return fast, but we’re not as strict about this.


    So we want 99% of user queries to return results in less than 10s


    —-- (next: back to graph)

    View Slide

  80. V6-21
    ARM64 hosts

    Spot instances
    What happens when you have lots of confidence in your systems ability to continuously repair and flex?


    You get to deploy lots of fun things to help make your life easier and make your service performant and costgless.


    Out of this entire diagram, our entire forward tier has been converted into spot instances. Preemptable aws instances. Because they recover from being restarted very easily, we can take
    advantage of that 60% cost savings.


    Secondly, three of these services are running on graviton 2 class, knowing that if there were a problem, we could easily revert.

    View Slide

  81. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved.
    Non-trivial savings.
    Production Shepherd EC2 cost, grouped by instance type
    But having rolled it out, it saved us 40% off our bill.


    Having the ability to take that leftover error budget and turn it into innovation or turn it into cost savings is how you justify being able to set that error budget and experiment with the error budget. Use it for the good of your service!

    View Slide

  82. V6-21
    Not every experiment succeeds.
    But you can mitigate the risks.
    45:00


    View Slide

  83. V6-21
    ● Ingest service crash

    ● Kafka instability

    ● Query performance degradation
    and what we learned from each.
    Three case studies of failure
    Three things that went catastrophically wrong, where we were at risk of violating one of our SLOs.

    View Slide

  84. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved.
    1) Shepherd: ingest API service
    Shepherd is the gateway to all ingest


    ● highest-traffic service


    ● stateless service


    ● cares about throughput first, latency
    close second


    ● used compressed JSON


    ● gRPC was needed.
    for graviton2, we chose to try things out on shepherd


    because it’s the highest-traffic but it’s also relatively straightforward


    it’s stateless and it only scales on CPU


    as a service, it’s optimized for throughput first and then latency


    We care about getting that data and sucking it out of customers and onto our disks very fast. We were previoiusly using a compressed json payload, transmitted over https. However, there is a new standard called open telemetry, a vendor neutral mechanism for transmitting for collecting data out of a service, including tracing data or metrics. It
    supports a grpc based protocol.over http2. Our customers were asking for this, and we knew it would be better and more effective for them in the long run. So we decided to make sure we can ingest not just our old http json protocol, but also the newer grpc protocol. So we said okay let’s go ahead and turn on a grpc listener, okay it works
    fine!

    View Slide

  85. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved. 85
    85
    Honeycomb Ingest Outage
    ● In November, we were working on OTLP and gRPC ingest support

    ● Let a commit deploy that attempted to bind to a privileged port

    ● Stopped the deploy in time, but scale-ups were trying to use the new build

    ● Latency shot up, took more than 10 minutes to remediate, blew our SLO
    Except it was binding on a privileged port, and crashing on startup.


    We managed to stop the deploy in time, thanks to a previous outage we had where we pushed out a bunch of binaries that didn’t build, so we had some health checks in place that would stop it from rolling out any further. That’s the good news. The bad news is, the new workers that were starting up were getting the new binary, and those
    workers were failing to serve traffic.


    And not only that, but because they weren’t serving traffic the cpu usage was zero. So aws autoscaler was like hey, let’s take that service and turn it down, you aren’t using it. So latency facing our end users went really high, and took us more than 10 minutes to remediate, which blew our SLO error budget

    View Slide

  86. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved. 86
    86
    Now what?
    ● We could freeze deploys (oh no, don’t do this!)

    ● Delay the launch? We considered this...

    ● Get creative!
    The SRE book says freeze deploys. Dear god no don’t do this. More and more product changes will just pile up. And the risk increases.


    Code ages like fine milk.


    So we recommend changingthe nature of your work from product features to reliability work, but using your normal deploy pipeline. So it’s changing the nature of the work you’re doing, but it’s not stopping all work, and it’s not setting traps for you later. By just blissfully pounding out features and hoping someday they work in production


    Should we delay the launch? That sucks, it was a really important launch for a partner of ours, and we knew our users wanted it.

    View Slide

  87. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved.
    Reduce Risk
    87
    87
    So we decided instead to apply infrastructure feature flagging. We decided to send the experimental path to a different set of workers to send http2 grpc traffic to a dedicated branch. That way, we can keep the 99.9% of users using json perfectly traffic, because we are tee-ing the traffic for them. At the load balancer level. This is hwo we
    ensured we could reliably serve as well as experiment.


    There were some risks, right? We had to ensure we were continuously integrating both branches together, we had to make sure that we had a mechanism for turning it down over time, but those are manageable compared to the cost of either not doing the launch or freezing releases.

    View Slide

  88. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved.
    2) Kafka: data bus
    Kafka provides durability


    ● Decoupling components provides safety.


    ● But introduces new dependencies.


    ● And things that can go wrong.
    So that was one example of making decisions based on our error budget. Let’s talk about a second outage


    Kafka is our persistence layer. Once shepherd has handed it off to kafka, the shepherd can go away, and it won’t drop data. It’s sitting in a queue waiting for a retriever indexer to pick it up. Decoupling the components provides safety. It means we can restart either producers or consumers more or less on demand, and they’ll be able to catch
    up and replay and get back to where you were.


    Kafka has a cost, though: complexity.

    View Slide

  89. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved.
    Our month of Kafka pain
    Read more: go.hny.co/kafka-lessons
    Longtime Confluent Kafka users


    First to use Kafka on Graviton2 at scale


    Changed multiple variables at once


    ● move to tiered storage


    ● i3en → c6gn


    ● AWS Nitro
    between shepherd and retriever sits kafka


    it allows us to decouple those two services and replay event streams


    We were having scalability issues with our kafka and needed to improve the reliability of them by consolidating. Instead of having 30 kafka nodes with very very large SSDs, we realized that because we are only replaying the most recent hour or two of data (unless something goes catastrophically wrong) on local ssd. Not only that, but there
    were out of these 30 individual kafka brokers, if any one of them went bad, you would be be in the middle of reshuffling nodes, and then if you lost another one it would just be siting idle because you can’t do a kafka rebalance while another rebalance is in process.


    So we tried tiered storage which would let us shrink from 30 to 6 kafka nodes. And the disks on those kafka brokers might be a little larger, but not 5x larger. So instead we’re sending that extra data off to aws s3.


    Then liz, loving arm 64 so much, was like why are we even using these monolithic nodes and local disks, isn’t ebs good enough? Can’t we use the highest compute power nodes and the highest performance disk perf. So we are now doing three changes at the same time.


    we were actually testing Kafka on Graviton2 before even Confluent did


    probably the first to use it for production workloads


    changed too many variables at once


    wanted to move to tiered storage to reduce the number of instances


    but also tried the arch switch from i3en to c6gn+EBS at the same time


    we also introduced AWS Nitro (hypervisor)


    that was a mistake


    we published a blog post on this experience as well as a full incident report


    I highly recommend that you go read it to better understand the decisions we made and what we learned


    View Slide

  90. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved.
    Unexpected constraints
    Read more: go.hny.co/kafka-lessons
    We thrashed multiple dimensions.


    We tickled hypervisor bugs.


    We tickled EBS bugs.


    Burning our people out wasn't worth it.
    And it exploded on us. We thought we were going to be right sizing cpu and disk, instead we blew out the network dimensions. We blew out the iops dimensions. Technically, we did not blow o ur SLO thru any of this. Except for there is another hidden slo. That SLO is that you do not page a honeycomb team more often than twice a weekl.


    Every engineer should have to work an incident out of hours, no more than once every six months. They’re on call every month or two, so you should have no more than one or two of those shifts that have


    We had to call a halt to the experiment. We were changing too many dimensions at once, chasing extra performance, it was not worth it.

    View Slide

  91. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved.
    Existing incident response practices


    ● Escalate when you need a break / hand-
    off


    ● Remind (or enforce) time off work to
    make up for off-hours incident response


    Official Honeycomb policy


    ● Incident responders are encouraged to
    expense meals for themselves and family
    during an incident
    Take care of your people
    We have pretty good incident response practices, we have blameless retrospectives, we had people handing off work to each other saying “you know what, i’m too tired, i can’t work on this incident any more.” we had people taking breaks afterwards.


    Being an adult means taking care of each other and taking care of yourself. Please expense your meals during an incident.


    incidents happen


    we had existing practices that helped a lot


    the meal policy was one of those things that just made perfect sense once somebody articulated it


    it’s good to document and make official policy out of things that are often unspoken agreements or assumptions


    —-


    one of our values is that we hire adults, and adults have responsibilities outside of work


    you won’t build a sustainable, healthy sociotechnical system if you don’t account for that


    in general it’s good to document and make official policy out of things that are often unspoken agreements or assumptions

    View Slide

  92. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved.
    Ensure people don’t feel rushed.


    Complexity multiplies


    ● if a software program change takes t hours,


    ● software system change takes 3t hours


    ● software product change also takes 3t
    hours


    ● software system product change = 9t hours


    Maintain tight feedback loops, but not everything
    has an immediate impact.
    Optimize for safety
    Source: Code Complete, 2nd Ed.
    We rushed a little in doing this. We didn’t blow our technical slo but we did blow our people


    hours are an imperfect measurement of complexity, but it’s a useful heuristic to keep in mind


    basically: complexity multiplies


    tight feedback loops help us isolate variables


    but some things just require observation over time


    isolating variables also makes it easier for people to update their mental models as changes go out


    View Slide

  93. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved.
    Retriever is performance-critical


    ● It calls to Lambda for parallel compute


    ● Lambda use exploded.


    ● Could we address performance & cost?


    ● Maybe.
    3) Retriever: query service
    Its job is to ingest data, index it and make it available for serving. Jess and ian talk at strangeloop


    Retriever fans out to potentially tens of thousands or millions of individual column store files that are stored in aws s3. So we adopted aws lambda and aws serverless to usemassively parallel compute on demand to read through those files on s3.

    View Slide

  94. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved. 94
    94
    Because we had seen really great results with graviton 2 for ec2 instances, we thought, maybe we should try that for lambda too! So we deployed to 1%, then 50%. Then we noticed things were twice as slow at the 99th percentile. Which means we are not getting cost savings because aws lambda does bill by the millisecond, and we were
    delivering inferior results.

    View Slide

  95. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved. 95
    95
    This is another example of how we were able to use our error budget to perform this experiment and the controls to roll it roll it back. And you can see that when we turned it off, it just turned off. Liz updated the flag at 6:48 pm, and at 6:48 pm you can see that orange line go to zero.

    View Slide

  96. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved. 96
    96
    Making progress carefully
    After that we decided we were not going to do experiments 50% at a time. We had already burned through that error budget so we started doing more rigorous performance profiling to identify the bottleneck. So we turn it on for a little bit and then we turn it back off. That way we get safety, stability, and the data we need to safely experiment.

    View Slide

  97. V6-21
    Fast and reliable: pick both!
    Go faster, safely.
    55:00


    Chaos engineering is something to do once you’ve taken care of the

    View Slide

  98. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved.
    Takeaways
    98
    98
    ● Design for reliability through full lifecycle.


    ● Feature flags can keep us within SLO, most of the time.

    ● But even when they can't, find other ways to mitigate risk.

    ● Discovering & spreading out risk improves customer experiences.

    ● Black swans happen; SLOs are a guideline, not a rule.
    If you’re running your continuous delivery pipelines throughout the day, then stopping them becomes the anomaly, not starting them. So by designing our delivery pipeline for reliability thru the full life cycle, we’ve ensured that we’re mostly able to meter us loads. Feature flags can keep us within SLOs, most of the time by managing the blast
    radius. Even when software flags can’t, there are other infrastructure level things you can do, such as running special workers to segregate traffic that is especially risky.


    By discovering risk at 3pm not 3am, it ensures the customer experience is much more resilient because you’ve actually tested the things that could go bump in the middle of the night.


    But if you do have a black swan event, it’s a guidelines not a rule. You don’t HAVE to say we’re switching everyone over to entirely reliability work. If you have something like massive facebook DNS outage or BGP outage.. It’s okay to hit reset on your error budget and say you know what, that’s probably not going to happen again. SLOs are
    for managing predictable-ish risks.

    View Slide

  99. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved.
    Examples of hidden risks


    ● Operational complexity


    ● Existing tech debt


    ● Vendor code and architecture


    ● Unexpected dependencies


    ● SSL certificates


    ● DNS


    Discover early and often through testing.
    Acknowledge hidden risks

    View Slide

  100. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved.
    Make experimentation routine!
    100
    100
    If you do this continuously all the time, a conversation like this becomes no longer preposterous. This actually would be chaos engineering, not just chaos. We have the ability to measure and watch our SLOs, we have the ability to limit the blast radius, and the observability to carefully inspect the result. That ‘s what makes it reasonable to
    say “hey let;s try denial of servicing our own workers, let’s try restarting the workers every 20 seconds and see what happens. Worst case, hit control-c on the script and it stops.

    View Slide

  101. V6-21
    © 2021 Hound Technology, Inc. All Rights Reserved.
    Takeaways
    101
    101
    ● We are part of sociotechnical systems: customers, engineers, stakeholders


    ● Outages and failed experiments are unscheduled learning opportunities

    ● Nothing happens without discussions between different people and teams

    ● Testing in production is fun AND good for customers

    ● Where should you start? DELIVERY TIME DELIVERY TIME DELIVERY TIME
    SLOs are an opportunity to have these conversations and find opportunities to move faster and talk about the tradeoffs between stability and speed, and are there creative things we can do to say yes to both

    View Slide

  102. V6-21
    Understand & control production.
    Go faster on stable infra.


    Manage risk and iterate.
    102

    View Slide

  103. V6-21
    Read our blog! hny.co/blog

    We're hiring! hny.co/careers
    Find out more

    View Slide

  104. V6-21
    www.honeycomb.io

    View Slide