Upgrade to Pro — share decks privately, control downloads, hide ads and more …

What got you here won't get you there (CodeFreeze 2020)

What got you here won't get you there (CodeFreeze 2020)

How your team can become a high-performing team by embracing observability

Charity Majors

January 16, 2020
Tweet

More Decks by Charity Majors

Other Decks in Technology

Transcript

  1. What got you here won't get you there
    How your team can become a ✨high-performing✨ team by
    embracing observability.
    Observability and the Glorious Future

    View Slide

  2. @mipsytipsy
    engineer/cofounder/CTO
    https://charity.wtf
    “the only good diff is a red diff”

    View Slide

  3. What does it mean to be a high-performing team?
    Why should you care? How can you convince others to care?
    How does a team develop a high-performing practice?
    Isn't this supposed to be a talk about observability?
    (Yes!)

    View Slide

  4. Why are computers hard?
    Because we don't understand them
    And we keep shipping things anyway
    You never learned to debug with science
    Vendors have happily misled you for $$$$

    View Slide

  5. We have ✨science✨ now!
    What does it mean to be a high-performing team?

    View Slide

  6. You only need to track ✨four things✨ to see where you stand.
    • How frequently do you deploy?
    • How long does it take for each deploy to go live?
    • How many of your deploys fail?
    • How long does it typically take to recover?

    View Slide

  7. It really, really, really,
    really, really
    pays off
    to be
    a high performer.
    Really.

    View Slide

  8. Elite teams are made up of all ex-Facebook, ex-Google, MIT grads...

    View Slide

  9. Excellent teams are made up of engineers who care about their work,
    communicate with each other, invest in incremental improvements, and
    are empowered to do their jobs.
    (Instead of "elite", let's say "excellent"?)
    Elite teams are made up of normal engineers who:
    take pride in their craft,
    care about their users,
    have time to fix and iterate

    View Slide

  10. What you need is production excellence.
    this work begins with observability.
    Happier customers, happier teams.

    View Slide

  11. 1. make your users happy
    2. make your team happy
    Every engineering org has two constituencies:

    View Slide

  12. are changing
    are changing
    The world is changing fast.

    View Slide

  13. • Ephemeral and dynamic
    • Far-flung and loosely coupled
    • Partitioned, sharded
    • Distributed and replicated
    • Containers, schedulers
    • Service registries
    • Polyglot persistence strategies
    • Autoscaled, multiple failover
    • Emergent behaviors
    • ... etc
    Complexity is soaring

    View Slide

  14. Architectural complexity
    2003 2013

    View Slide

  15. We are bad at understanding our systems.

    View Slide

  16. Tools for understanding them
    known-unknowns unknown-unknowns
    monitoring observability

    View Slide

  17. ("understand" lol)

    View Slide

  18. Observability is NOT the same as monitoring.

    View Slide

  19. What's the difference between monitoring and observability?

    View Slide

  20. @grepory, Monitorama 2016
    “Monitoring is dead.”
    “Monitoring systems have not changed significantly in 20 years and has fallen behind the way we
    build software. Our software is now large distributed systems made up of many non-uniform
    interacting components while the core functionality of monitoring systems has stagnated.”

    View Slide

  21. Observability
    “In control theory, observability is a measure of how well internal
    states of a system can be inferred from knowledge of its external
    outputs. The observability and controllability of a system are
    mathematical duals." — wikipedia
    … translate??!?

    View Slide

  22. Observability for software engineers:
    Can you understand what’s happening inside your
    systems, just by asking questions from the outside? Can
    you debug your code and its behavior using its output?
    Can you answer new questions without shipping new code?

    View Slide

  23. You have an observable system
    when your team can quickly and reliably track
    down any new problem with no prior knowledge.
    For software engineers, this means being able to
    reason about your code, identify and fix bugs, and
    understand user experiences and behaviors ...
    via your instrumentation.

    View Slide

  24. Monitoring
    Represents the world from the perspective of a third party, and describes
    the health of the system and/or its components in aggregate.
    Observability
    Describes the world from the perspective of the software, as it performs
    each request. Softwareexplaining itself back to you from the inside.

    View Slide

  25. We don’t *know* what the questions are, all
    we have are unreliable symptoms or reports.
    Complexity is exploding everywhere,
    but our tools are designed for
    a predictable world.
    As soon as we know the question, we usually
    know the answer too.

    View Slide

  26. Many catastrophic states exist at any given time.
    Your system is never entirely ‘up’

    View Slide

  27. Observability is...
    • High cardinality
    • High dimensionality
    • Exploratory, open-ended
    • Based on arbitrarily-wide structured events with span support
    • No indexes, schemas, or predefined structure
    • About understanding unknown-unknowns with no prior knowledge
    • About systems, not code. Where in the system is the code you need to fix?
    • Young. Early. There is much still to be discovered.
    • Aligned with the user's experience.

    View Slide

  28. Observability is not...
    • Able to be built on top of a metrics store
    • Comprised of pillars (this is shitty vendorspeak)
    • Achievable with preaggregation.
    • Achievable without sampling (or infinite money) (at scale)
    • About the health of the backend or services.
    • Achievable without instrumentation
    • Doable without tracing.
    • Or exclusively about tracing.

    View Slide

  29. LAMP stack
    The app tier capacity is exceeded.
    Maybe we rolled out a build with a
    perf regression, or maybe some app
    instances are down.
    DB queries are slower than normal.
    Maybe we deployed a bad new
    query, or there is lock contention.
    Errors or latency are high. We will
    look at several dashboards that reflect
    common root causes, and one of
    them will show us why.
    “Photos are loading slowly
    for some people. Why?”
    These are known-unknowns.
    Monitor for them.

    View Slide

  30. Distributed systems
    "Any microservices running on
    c2.4xlarge instances and PIOPS storage
    in us-east-1b has a 1/20 chance of
    running on degraded hardware, and will
    take 20x longer to complete for
    requests that hit the disk with a blocking
    call. This disproportionately impacts
    people looking at older archives due to
    our fanout model."
    "Canadian users who are using the
    French language pack on the iPad
    running iOS 9, are hitting a firmware
    condition which makes it fail saving to
    local cache … which is why it FEELS
    like photos are loading slowly"
    "Our newest SDK makes db queries
    sequentially if the developer has
    enabled an optional feature flag.
    Working as intended; the reporters all
    had debug mode enabled. But flag
    should be renamed for clarity sake."
    Monitor for .... ???

    View Slide

  31. Distributed systems
    "I have twenty microservices and a
    sharded db and three other data
    stores across three regions, and
    everything seems to be getting a little
    bit slower over the past two weeks but
    nothing has changed that we know of,
    and oddly, latency is usually back to
    the historical norm on Tuesdays."
    “All twenty microservices have 10% of
    available nodes enter a crash loop
    about five times a day, at
    unpredictable intervals. They have
    nothing in common and it doesn’t
    seem to impact the stateful services. It
    clears up before we can debug it,
    every time. We have tried replacing
    the instances."
    “Our users can compose their own
    queries that we execute server-side,
    and we don’t surface it to them when
    they are accidentally doing full table
    scans or even multiple full table scans,
    so they blame us.”

    View Slide

  32. Distributed systems
    “Users in Romania are complaining
    that all push notifications have been
    down for days. This seems
    impossible, since we share a queue
    with them."
    “Disney is complaining that once in a
    while, but not always, they don’t see
    the profile photo they expected to see
    — they see someone else’s photo!
    When they refresh, it’s fixed.”
    “Sometimes a bot takes off, or an app
    is featured on the iTunes store, and it
    takes us a long time to track down
    which app or user is generating
    disproportionate pressure on shared
    system components.
    “We run a platform, and it’s hard to
    programmatically distinguish
    between errors that users are
    inflicting on themselves and problems
    in our code, since they all manifest as
    errors or timeouts."

    View Slide

  33. Distributed systems
    These are all unknown-unknowns,
    which may never have happened before or
    happen again.
    (welcome to distributed systems)

    View Slide

  34. LAMP stack
    • THE database
    • THE application
    • Known-unknowns and mostly
    predictable failures
    • Many monitoring checks
    • Many paging alerts
    • "Flip a switch" to deploy
    • Failures to be prevented
    • Production is to be feared
    • Debug by intuition and scar tissue of
    past outages
    • Canned dashboards
    • Deploys are scary
    • Masochistic on-call culture
    technical aspects, cultural associations

    View Slide

  35. LAMP stack
    • Dev/Ops
    • Fragile, forbidding edifice
    • "Glass Castle"
    We have built our systems
    like glass castles,
    fragile and forbidding,
    hostile to exploration and
    experimentation .

    View Slide

  36. Distributed systems
    technical aspects, cultural associations
    • Many storage systems
    • Diversity of service types
    • Unknown-unknowns; every alert is
    novel
    • Rich, flexible instrumentation
    • Few paging alerts
    • Deployment is like baking
    • Failures are your friend
    • Production is where your users live
    • Debug methodically by examining the
    evidence
    • Events and full context, not metrics
    • Deploys are opportunities
    • Humane on-call culture

    View Slide

  37. best practices
    • Software ownership -- you build it, you
    run it
    • Robust, resilient, built for
    experimentation and delight
    • Human scale, safety measures baked in
    Distributed systems

    View Slide

  38. Here's the dirty little secret.
    The next generation of systems won't be built and run by burned out, exhausted
    people, or command-and-control teams just following orders.
    It can't be done.
    they've become too complicated. too hard.

    View Slide

  39. We don’t know what the questions
    actually are though, all we have are
    unreliable reports.
    Our tools were designed for
    a predictable world.
    As soon as we know the question, we
    usually know the answer too.
    We have tools that help us ask and
    answer questions, esp if we define
    them in advance.

    View Slide

  40. We can no longer fit these systems in our heads
    and reason about them -- if we try, we'll be
    outcompeted by teams who use proper tools.
    Our systems are emergent and unpredictable. We need more than
    just your logical brain; we need your full creative self.

    View Slide

  41. How observability leads to
    high-performing teams.
    Resiliency
    High quality code
    Predictable releases
    Manageable complexity and tech debt
    User behavior
    https://www.honeycomb.io/blog/toward-a-maturity-model-for-observability/

    View Slide

  42. Resiliency
    • System uptime meets your
    goals
    • Alerts are not ignored
    • Oncall is not excessively
    stressful
    • Staff turnover is low; no
    burnout
    • Outages are frequent.
    • Spurious alerts
    • Alert fatigue
    • Troubleshooting is unpredictable/hard
    • Repair is unpredictable/time-consuming
    • Some critical members get fried
    O11y gives you context and helps you resolve incidents swiftly

    View Slide

  43. High-quality code
    • Code is stable
    • Customer happiness, not
    support
    • Debugging is intuitive
    • No cascading failures
    • Customer support costs are high
    • High % of engineering time on bugs
    • Fear around deploys process
    • Long time to find and repro bugs
    • Unpredictable time to solve problems
    • Low confidence in code when shipped
    O11y lets you watch deploys, find bugs early

    View Slide

  44. Predictable releases
    • Release cadence matches
    goals
    • Code goes in prod
    immediately
    • Code paths turned on/off
    easily
    • Deploy/rollback are fast
    O11y helps you manage your complex build pipeline as well as
    deploys, so you can ship swiftly and with confidence
    • Releases are infrequent
    • Need lots of human intervention
    • Many changes ship at once
    • Releases are order-dependent
    • Sales has to gate releases on promise trai
    • People avoid doing deploys at times

    View Slide

  45. Manageable complexity and tech debt
    • Spend your time on actual
    goals
    • Bugs and reliability are
    tractable
    • Easy to find code to fix
    • Answer any question w/o
    shipping new code
    O11y helps you do the right work at the right time
    • Waste time rebuilding and
    refactoring
    • Teams are distracted by fixing the
    wrong thing or the wrong way
    • Uncontrollable ripple effects from a
    local change
    • "haunted graveyard" where people
    are afraid to make changes

    View Slide

  46. Understand user behavior
    • Instrumentation is easy to add
    • Easy access to KPIs for devs
    • Feature flagging
    • PMs have useful view of
    customers
    • Teams share view of reality
    O11y grounds you in reality.
    • Product doesn't have their finger on
    pulse
    • Devs feel their work doesn't have
    impact
    • Features get scope creep
    • PMF not achieved

    View Slide

  47. "But I don't have time to invest in observability..."
    You can't afford not to.

    View Slide

  48. You can't afford not to.

    View Slide

  49. Eng quality of life is linked to high performing teams and resilient
    systems.

    View Slide

  50. p.s. o11y is also a prerequisite for other modern things, like ...
    chaos engineering
    sane deploys
    testing in production
    and other modern best practices.

    View Slide

  51. where are we going?

    View Slide

  52. on call will be shared by everyone who writes code.
    on call must not be miserable.
    (on call will be less like a heart attack,
    more like dentist visits, or gym appts)

    View Slide

  53. serverless was a harbinger
    deploy less is coming

    View Slide

  54. invest in your deploys, democratize access to data
    don't be scared by regulations

    View Slide

  55. build your devs a playground
    ... but build guard rails
    encourage curiosity, emphasize ownership. don't punish. get up to your
    elbows in prod EVERY DAY
    practice many small failures
    practice, practice, practice
    senior engineers : amplify hidden costs

    View Slide

  56. are changing

    are changing
    are changing

    View Slide

  57. 1. make your users happy
    2. make your team happy
    Every engineering org has two constituencies:

    View Slide

  58. 1. nines don't matter if users aren't happy
    2. great teams build high-quality systems
    Corollary:

    View Slide

  59. we have an opportunity here to make things better
    let's do it <3

    View Slide

  60. • It
    Charity Majors
    @mipsytipsy

    View Slide