Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Observability and Complex Systems (devopsdays AMS)

Observability and Complex Systems (devopsdays AMS)

Distributed systems, microservices, containers and schedulers, polyglot persistence .. modern infrastructure patterns are fluid and dynamic, chaotic and transient. So why are we still using LAMP-stack era tools to debug and monitor them? We’ll cover some of the many shortcomings of traditional metrics and logs (and APM tools backed by metrics or logs), and show how complexity is their kryptonite.

So how do we handle the coming complexity Armageddon? It starts with a more modern approaches to observability, such as client-side structured data and event-driven debugging, distributed tracing, and more; no matter how many tags you add to a metric store, it still can’t tell a story like events can. It also means shifting perspective away from “monitoring” and to “instrumentation”.

Most problems are transient or self-healing, and you cannot possibly alert on (or even predict) the long tail of possible partial failures. So you need to turn finding arbitrarily complex causes into a support problem, not an engineering problem. How? Well … that’s the fun part.

Charity Majors

June 27, 2019
Tweet

More Decks by Charity Majors

Other Decks in Technology

Transcript

  1. Charity Majors
    @mipsytipsy
    Observability & Complex Systems
    What got you here won't get you there, and other terrifying true
    tales from the computing frontier

    View Slide

  2. @mipsytipsy
    engineer/cofounder/CEO
    https://charity.wtf
    “the only good diff is a red diff”

    View Slide

  3. A short partial list of things I would like to touch on...
    "chaos engineering"
    you must be this tall to ride this ride.
    (are you? how do you evaluate this?)
    observability
    business intelligence, aka why nothing we are doing is remotely new
    why tools create silos
    the implications of democratizing access to data
    particularly for levels and career progressions
    how deploys must change
    the mis allocation of internal tooling energy away rom deploy software
    why you need to test in prod
    why you need a canary (probably)
    when to know you need a canary

    View Slide

  4. "chaos engineering"
    you must be this tall to ride this ride.
    (are you? how do you evaluate this?)
    business intelligence, aka why nothing we are doing is remotely new
    why tools create silos
    the implications of democratizing access to data
    particularly for levels and career progressions
    how deploys must change
    the mis allocation of internal tooling energy away rom deploy software
    why you need to test in prod
    why you need a canary (probably)
    when to know you need a canary
    why you definitely need feature flags, no matter what
    test doesn't mean what you think it means
    continued ...

    View Slide

  5. the future of development is observavbility-observability-driven development.
    "O-D-D yeah YOU KNOW ME"
    why we have to stop leaning on intuition and tribal knowledge before it is
    too late
    why AIOps is stupid and doomed
    why the team is your best source o wisdom
    why wisdom is not truth
    why ops needs to learn about design principles, stat
    why vendors are rushing to coopt the observability message before you notice
    they don't actually fulfill the demands, and why this makes me Very Stabby
    cont'd ... just a brief outline

    View Slide

  6. "How did we get here?"

    View Slide

  7. Monitoring (time series databases, dashboards, 'metric' tools)
    Logs (messy ass strings, really)
    More recently, APM and tracing.
    The trifecta:

    View Slide

  8. "What do we need to get where we're
    going?"

    View Slide

  9. Our idea of what the software development
    lifecycle even looks like is overdue for an upgrade
    in the era of distributed systems.

    View Slide

  10. Deploying code is not a binary switch.
    Deploying code is a process of increasing your
    confidence in your code.

    View Slide

  11. Development Production
    deploy

    View Slide

  12. Observability
    Development Production

    View Slide

  13. Observability
    Development Production

    View Slide

  14. why now?

    View Slide

  15. “Complexity is increasing” - Science

    View Slide

  16. Architectural complexity
    Parse, 2015
    LAMP stack, 2005

    View Slide

  17. monitoring => observability
    known unknowns => unknown unknowns
    LAMP stack => distributed systems

    View Slide

  18. We are all distributed systems
    engineers now
    the unknowns outstrip the knowns
    why does this matter more and more?

    View Slide

  19. Distributed systems are particularly hostile to being
    cloned or imitated (or monitored).
    (clients, concurrency, chaotic traffic patterns, edge cases …)

    View Slide

  20. Distributed systems have an infinitely long list of
    almost-impossible failure scenarios that make staging
    environments particularly worthless.
    this is a black hole for engineering time

    View Slide

  21. Operational literacy
    Is not a nice-to-have

    View Slide

  22. Without observability, you don't have "chaos
    engineering". You just have chaos.
    So what is observability?

    View Slide

  23. Observability is NOT the same as monitoring.

    View Slide

  24. @grepory, Monitorama 2016
    “Monitoring is dead.”
    “Monitoring systems have not changed significantly in 20 years and has fallen behind the way we
    build software. Our software is now large distributed systems made up of many non-uniform
    interacting components while the core functionality of monitoring systems has stagnated.”

    View Slide

  25. Observability
    “In control theory, observability is a measure of how well internal
    states of a system can be inferred from knowledge of its external
    outputs. The observability and controllability of a system are
    mathematical duals." — wikipedia
    … translate??!?

    View Slide

  26. Can you understand what’s happening inside your
    systems, just by asking questions from the outside? Can
    you debug your code and its behavior using its output?
    Can you answer new questions without shipping new code?
    Observability... for software engineers:

    View Slide

  27. Monitoring
    Represents the world from the perspective of a third party, and
    describes the health of the system and/or its components in aggregate.
    Observability
    Describes the world from the first-person perspective of the software,
    executing each request. Software explaining itself from the inside out.

    View Slide

  28. We don’t *know* what the questions are, all
    we have are unreliable symptoms or reports.
    Complexity is exploding everywhere,
    but our tools are designed for
    a predictable world.
    As soon as we know the question, we usually
    know the answer too.

    View Slide

  29. Welcome to distributed systems.
    it’s probably fine.
    (it might be fine?)

    View Slide

  30. Many catastrophic states exist at any given time.
    Your system is never entirely ‘up’

    View Slide

  31. Distributed systems have an infinitely long list of
    almost-impossible failure scenarios that make staging
    environments particularly worthless.
    this is a black hole for engineering time

    View Slide

  32. You do it.
    You have to do it.
    Do it well.

    View Slide

  33. Let’s try some examples!
    Can you quickly and reliably track down problems like these?

    View Slide

  34. The app tier capacity is exceeded. Maybe we
    rolled out a build with a perf regression, or
    maybe some app instances are down.
    DB queries are slower than normal. Maybe
    we deployed a bad new query, or there is lock
    contention.
    Errors or latency are high. We will look at
    several dashboards that reflect common root
    causes, and one of them will show us why.
    “Photos are loading slowly for some people. Why?”
    Monitoring
    (old-school LAMP stack)
    monitor these things

    View Slide

  35. “Photos are loading slowly for some people. Why?”
    (microservices)
    Any microservices running on c2.4xlarge
    instances and PIOPS storage in us-east-1b has a
    1/20 chance of running on degraded hardware,
    and will take 20x longer to complete for requests
    that hit the disk with a blocking call. This
    disproportionately impacts people looking at
    older archives due to our fanout model.
    Canadian users who are using the French
    language pack on the iPad running iOS 9, are
    hitting a firmware condition which makes it fail
    saving to local cache … which is why it FEELS
    like photos are loading slowly
    Our newest SDK makes db queries
    sequentially if the developer has enabled an
    optional feature flag. Working as intended;
    the reporters all had debug mode enabled.
    But flag should be renamed for clarity sake.
    wtf do i ‘monitor’ for?!
    Monitoring?!?

    View Slide

  36. Problems Symptoms
    "I have twenty microservices and a sharded
    db and three other data stores across three
    regions, and everything seems to be getting a
    little bit slower over the past two weeks but
    nothing has changed that we know of, and
    oddly, latency is usually back to the historical
    norm on Tuesdays.
    “All twenty app micro services have 10% of
    available nodes enter a simultaneous crash
    loop cycle, about five times a day, at
    unpredictable intervals. They have nothing in
    common afaik and it doesn’t seem to impact
    the stateful services. It clears up before we
    can debug it, every time.”
    “Our users can compose their own queries that
    we execute server-side, and we don’t surface it
    to them when they are accidentally doing full
    table scans or even multiple full table scans, so
    they blame us.”
    Observability
    (microservices)

    View Slide

  37. Still More Symptoms
    “Several users in Romania and Eastern
    Europe are complaining that all push
    notifications have been down for them … for
    days.”
    “Disney is complaining that once in a while,
    but not always, they don’t see the photo they
    expected to see — they see someone else’s
    photo! When they refresh, it’s fixed. Actually,
    we’ve had a few other people report this too,
    we just didn’t believe them.”
    “Sometimes a bot takes off, or an app is
    featured on the iTunes store, and it takes us a
    long long time to track down which app or user
    is generating disproportionate pressure on
    shared components of our system (esp
    databases). It’s different every time.”
    Observability
    “We run a platform, and it’s hard to
    programmatically distinguish between
    problems that users are inflicting themselves
    and problems in our own code, since they all
    manifest as the same errors or timeouts."
    (microservices)

    View Slide

  38. These are all unknown-unknowns
    that may have never happened before, or ever happen again
    (They are also the overwhelming majority of what you have
    to care about for the rest of your life.)

    View Slide

  39. Three principles of software ownership:
    They who write the code
    Can and should deploy their code
    And watch it run it in production.
    (**and be on call for it)

    View Slide

  40. When healthy teams with good cultural values and
    leadership alignment try to adopt software
    ownership and fail, the cause is usually an
    observability gap.

    View Slide

  41. Software engineers spend too much time looking at code in elaborately
    falsified environments, and not enough time observing it in the real world.
    Tighten feedback loops. Give developers the
    observability tooling they need to become fluent in
    production and to debug their own systems.
    We aren’t “writing code”.
    We are “building systems”.

    View Slide

  42. Observability for SWEs and the Future™
    well-instrumented
    high cardinality
    high dimensionality
    event-driven
    structured
    well-owned
    sampled
    tested in prod.

    View Slide

  43. Watch it run in production.
    Accept no substitute.
    Get used to observing your systems when they AREN’T on fire

    View Slide

  44. Real data
    Real users
    Real traffic
    Real scale
    Real concurrency
    Real network
    Real deploys
    Real unpredictabilities.

    View Slide

  45. You care about each and every tree, not the forest.
    "The health of the system no longer really matters" -- me

    View Slide

  46. Zero users care what the “system” health is
    All users care about THEIR experience.
    Nines don’t matter if users aren’t happy.
    Nines don’t matter if users aren’t happy.
    Nines don’t matter if users aren’t happy.
    Nines don’t matter if users aren’t happy.
    Nines don’t matter if users aren’t happy.

    View Slide

  47. well-instrumented
    high cardinality
    high dimensionality
    event-driven
    structured
    well-owned
    sampled
    tested in prod.
    Observability for SWEs and the Future™

    View Slide

  48. You win …
    Drastically fewer paging alerts!

    View Slide

  49. Charity Majors
    @mipsytipsy

    View Slide

  50. • Srecon
    Charity Majors
    @mipsytipsy

    View Slide

  51. You must be able to break down by 1/millions and
    THEN by anything/everything else
    High cardinality is not a nice-to-have
    ‘Platform problems’ are now everybody’s problems

    View Slide