Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Keep calm and carry on: scaling your org with microservices

Keep calm and carry on: scaling your org with microservices

from ddd exchange in london, april 2018

Charity Majors

April 27, 2018
Tweet

More Decks by Charity Majors

Other Decks in Technology

Transcript

  1. Keep Calm and Carry On:
    Scaling Your Org With Microservices
    Charity Majors, @mipsytipsy

    View Slide

  2. Keep Calm and Carry On:
    Scaling Your Org With Microservices
    Charity Majors, @mipsytipsy

    View Slide

  3. @mipsytipsy
    engineer/cofounder/CEO
    “the only good diff is a red diff”

    View Slide

  4. @mipsytipsy
    engineer, cofounder, CEO

    View Slide

  5. What even is a
    microservice
    (No one knows)

    View Slide

  6. What are microservices?
    • Monorepo — sometimes

    • Independently deployable, small modular services

    • Decentralized governance

    • Small teams, up to maybe a dozen people

    • Operating independently, interacting with other teams via APIs

    View Slide

  7. Naming a thing has power
    convention
    design patterns
    community
    best practices
    vetted in multiple environments
    “hey, haven’t we been doing this forever?”

    View Slide

  8. microservices(n)
    a real-world application of distributed systems engineering principles to
    software architecture

    View Slide

  9. Welcome to distributed systems.
    it’s probably fine.
    (it might be fine?)

    View Slide

  10. “Dear Twitter …”

    View Slide

  11. Architectural complexity
    Parse, 2015
    LAMP stack, 2005

    View Slide

  12. monitoring => observability
    known unknowns => unknown unknowns

    View Slide

  13. Many catastrophic states exist at any given time.
    Your system is never entirely ‘up’

    View Slide

  14. Welcome to distributed systems.
    Everything fails.
    All the time.

    View Slide

  15. “Complexity is increasing” - Science

    View Slide

  16. Monolith
    Microservices

    View Slide

  17. You need:
    a new mindset
    new habits
    new tools
    a sense of humor
    forgiveness :)
    remember … you are still an early adopter!

    View Slide

  18. You need:
    a new mindset
    new habits
    new tools
    a sense of humor
    forgiveness :)
    remember … you are still an early adopter!

    View Slide

  19. devs & ops => software owners
    monitoring => observability
    staging => test in prod
    availability => resiliency
    aggregation => sampling
    … all your communication
    … your entire org structure
    What changes?

    View Slide

  20. Software needs owners.
    Not operators, not developers
    Owners have impact on the full lifecycle
    of their software: build, fix, listen, patch,
    commit, deploy, revert, rollback,
    instrument, understand, anticipate,
    verify, validate.
    devs & ops => software owners

    View Slide

  21. View Slide

  22. and from a DBA at a different company … …

    View Slide

  23. View Slide

  24. The most powerful weapon in your arsenal
    is always cause and effect.
    Engineers should be on call
    for their own services.

    View Slide

  25. • Guard your people’s time and sleep

    • No hero complexes. No martyrs.

    • Don’t over-page. Align engineering pain with customer pain

    • Roll up non-urgent alerts for daytime hours

    • Your most valuable paging alerts are end-to-end checks on
    critical code paths.
    Corollary: on-call must not be hell.

    View Slide

  26. Probe every software engineering candidate
    for their ops experience & attitude.
    … yep, even FE/mobile devs!

    View Slide

  27. “Operations is valued here.”
    you are signaling …

    View Slide

  28. Senior software engineers should be reasonably good at these things.
    So if they are not, don’t promote them.
    Operations engineering is about making systems
    maintainable, reliable, and comprehensible.

    View Slide

  29. staging => test in prod

    View Slide

  30. View Slide

  31. Distributed systems are particularly hostile to being
    cloned or imitated.
    (clients, concurrency, chaotic traffic patterns, edge cases …)
    These systems have an infinitely long list of almost-impossible failure scenarios
    that make staging copies particularly worthless.
    this is a black hole for engineering time

    View Slide

  32. unit tests
    integration tests
    functional tests
    basic failover
    test before prod:
    … the basics. the simple stuff.

    View Slide

  33. behavioral tests
    experiments
    load tests (!!)
    edge cases
    canaries
    rolling deploys
    multi-region
    test in prod:
    … where shit gets real.

    View Slide

  34. That energy is better used elsewhere:
    Production.
    You can catch 80% of the bugs with 20% of the effort. And you should.
    @caitie’s PWL talk: https://youtu.be/-3tw2MYYT0Q

    View Slide

  35. feature flags (launch darkly
    high cardinality tooling (honeycomb)
    gate your releases ()
    canaries,
    shadow systems (goturbine, linkerd)
    capture/replay for databases (apiary, percona)
    also build or use:
    jk dont build your own

    View Slide

  36. Failure is not rare
    Practice shipping and fixing lots of small problems
    And practice on your users!!

    View Slide

  37. Does everyone …
    know what normal looks like?
    know how to deploy?
    know how to roll back?
    know how to canary?
    know how to debug in production?
    Practice!!~

    View Slide

  38. View Slide

  39. Failure: it’s “when”, not “if”
    (lots and lots and lots of “when’s”)

    View Slide

  40. 1. Canarying. Automated canarying. Promotion of canaries.
    2. Making deploys more automated and robust
    3. Making the fastest path the correctest/safest path
    4. Limiting the critical path. Limiting the blast radius.
    5. Shipping features behind feature flags
    6. Making rollbacks just another boring deploy
    7. Instrumentation. Good defaults. Test on employees.
    Your allies:
    These are *always* a good use of your time.
    (Staging is *sometimes* a good use of your time)

    View Slide

  41. Why do people sink so much time into staging,
    when they can’t even tell if their own
    production environment is healthy or not?

    View Slide

  42. You have an observable system
    when your team can quickly and reliably track
    down any new problem in real time..
    monitoring => observability

    View Slide

  43. Observability
    “In control theory, observability is a measure of how well internal
    states of a system can be inferred from knowledge of its external
    outputs. The observability and controllability of a system are
    mathematical duals." — wikipedia
    … translate??!?

    View Slide

  44. Observability
    Can you understand what’s happening inside your code and systems,
    simply by asking questions using your tools? Can you answer any
    new question you think of, or only the ones you prepared for?
    Having to ship new code every time you want to ask a new question …
    SUCKS.

    View Slide

  45. The app tier capacity is exceeded. Maybe we
    rolled out a build with a perf regression, or
    maybe some app instances are down.
    DB queries are slower than normal. Maybe
    we deployed a bad new query, or there is lock
    contention.
    Errors or latency are high. We will look at
    several dashboards that reflect common root
    causes, and one of them will show us why.
    “Photos are loading slowly for some people. Why?”
    Monitoring
    (LAMP stack)
    monitor these things

    View Slide

  46. Characteristics
    Monitoring
    • Known-unknowns predominate
    • Intuition-friendly
    • Dashboards are valuable.
    • Monolithic app, single data source.
    • The health of the system more or less accurately
    represents the experience of the individual users.
    (LAMP stack)

    View Slide

  47. Best Practices
    Monitoring
    • Lots of actionable active checks and alerts
    • Proactively notify engineers of failures and warnings
    • Maintain a runbook for stable production systems
    • Rely on clusters and clumps of tightly coupled
    systems all breaking at once

    View Slide

  48. “Photos are loading slowly for some people. Why?”
    (microservices)
    Any microservices running on c2.4xlarge
    instances and PIOPS storage in us-east-1b has a
    1/20 chance of running on degraded hardware,
    and will take 20x longer to complete for requests
    that hit the disk with a blocking call. This
    disproportionately impacts people looking at
    older archives due to our fanout model.
    Canadian users who are using the French
    language pack on the iPad running iOS 9, are
    hitting a firmware condition which makes it fail
    saving to local cache … which is why it FEELS
    like photos are loading slowly
    Our newest SDK makes db queries
    sequentially if the developer has enabled an
    optional feature flag. Working as intended;
    the reporters all had debug mode enabled.
    But flag should be renamed for clarity sake.
    wtf do i ‘monitor’ for?!
    Monitoring?!?

    View Slide

  49. These are all unknown-unknowns
    that may have never happened before, or ever happen again
    (They are also the overwhelming majority of what you have
    to care about for the rest of your life.)

    View Slide

  50. Characteristics
    • Unknown-unknowns are most of the problems
    • “Many” components and storage systems
    • You cannot model the entire system in your head.
    Dashboards may be actively misleading.
    • The hardest problem is often identifying which
    component(s) to debug or trace.
    • The health of the system is irrelevant. The health of
    each individual request is of supreme consequence.
    (microservices/complex systems)
    Observability

    View Slide

  51. Best Practices
    • Rich instrumentation.
    • Events, not metrics.
    • Sampling, not write-time aggregation.
    • Few (if any) dashboards.
    • Test in production.. a lot.
    • Very few paging alerts.
    Observability
    (microservices/complex systems)

    View Slide

  52. Why:
    Instrumentation?
    Events, not metrics?
    No dashboards?
    Sampling, not time series aggregation?
    Test in production?
    Fewer alerts?

    View Slide

  53. 8 commandments for a Glorious Future™
    well-instrumented
    high cardinality
    high dimensionality
    event-driven
    structured
    well-owned
    sampled
    tested in prod.

    View Slide

  54. Instrumentation?
    Start at the edge and work down
    Internal state from software you didn’t write, too
    Wrap every network call, every data call
    Structured data only
    `gem install` magic will only get you so far

    View Slide

  55. Events, not metrics?
    (trick question.. you’ll need both
    but you’ll rely on events more and more)
    Cardinality
    Context
    Structured data

    View Slide

  56. UUIDs
    db raw queries
    normalized queries
    comments
    firstname, lastname
    PID/PPID
    app ID
    device ID
    HTTP header type
    build ID
    IP:port
    shopping cart ID
    userid
    ... etc
    Some of these …
    might be …
    useful …
    YA THINK??!
    High cardinality will save your ass.
    Metrics
    (cardinality)

    View Slide

  57. You must be able to break down by 1/millions and
    THEN by anything/everything else
    High cardinality is not a nice-to-have
    ‘Platform problems’ are now everybody’s problems

    View Slide

  58. Events tell stories.
    Arbitrarily wide events mean you can amass more and more context
    over time. Use sampling to control costs and bandwidth.
    Structure your data at the source to reap
    massive efficiencies over strings.
    Events
    (“Logs” are just a transport mechanism for events)

    View Slide

  59. Dashboards

    View Slide

  60. Raw
    Fast
    Iterative
    Interactive
    Exploratory

    View Slide

  61. Dashboard
    overuse
    must die
    Unknown-unknowns
    demand explorability
    and an open mind.

    View Slide

  62. sampling, not aggregation
    Raw requests:

    View Slide

  63. Aggregation is a one-way trip
    Destroying raw events eliminates your ability to ask new questions.
    Forever.
    Aggregates are the devil

    View Slide

  64. Aggregates destroy your precious details.
    You need MORE detail and MORE context.
    Aggregates

    View Slide

  65. availability => resiliency
    Shrink the critical path
    Automatedly remediate
    Invest in canaries
    Build exploratory, open-ended introspection
    Observability > *

    View Slide

  66. Software needs owners.
    Not operators, not developers
    Owners have impact on the full lifecycle
    of their software: build, fix, listen, patch,
    commit, deploy, revert, rollback,
    instrument, understand, anticipate,
    verify, validate.
    aggregation => sampling

    View Slide

  67. … all your communication
    @mranney, Uber
    “With microservices, you cleverly swap out your
    technical problems for political problems.”

    View Slide

  68. Deploys

    On-Call

    Pull requests, arch reviews

    Observability

    Code is communication.

    View Slide

  69. Deploys

    View Slide

  70. Deploys must be:
    • Fast. Rolling. Roll-back-able.

    • Reliable. Breaks rarely.

    • Draws a tagged vertical line in graphs.

    • *Anyone* should be able to invoke deploy

    • For bonus points: canarying or automated

    View Slide

  71. Revisit these tools regularly.
    part of every post mortem.

    View Slide

  72. View Slide

  73. (what the actual fuck? do it anyway.)

    View Slide

  74. most outages are triggered by “events”,
    from humans. draw a line.

    View Slide

  75. View Slide

  76. … your entire org structure
    @mranney, Uber
    “With microservices, you cleverly swap out your
    technical problems for political problems.”

    View Slide

  77. View Slide

  78. embrace the chaos
    seek resiliency

    View Slide

  79. Conway’s
    “Law”

    View Slide

  80. Conway’s Law, post-Jobs

    View Slide

  81. “Conway’s Law” is not a law

    View Slide

  82. Hard things are hard.
    don’t do them if you don’t have to!

    View Slide

  83. Microservices are about changes.

    View Slide

  84. seek feedback
    move forward <3
    change is the only constant

    View Slide

  85. Choose the problems you are not
    going to solve, or they will choose you.

    View Slide

  86. Yes but ….
    Yes, microservices helps you drift a little bit and innovate independently …
    BUT, not as much as you might think.
    You all still share a fabric, after all.
    Stateful still gonna ruin your party. (and IPC, sec discovery, caching, cd
    pipelines, databases etc.)

    View Slide

  87. References:
    Conway’s Law
    Swap tech problems for political
    Multiple repos
    http://blog.christianposta.com/microservices/youre-not-going-to-do-microservices/
    Terrific talks by @aspyker, @adrianco, @samnewman, @martinfowler, @mattranney, etc:
    https://medium.facilelogin.com/ten-talks-on-microservices-you-cannot-miss-at-any-
    cost-7bbe5ab7f43f#.qqzeqpw2l
    https://www.infoq.com/presentations/7-sins-microservices

    View Slide

  88. Charity Majors
    @mipsytipsy

    View Slide