Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Keep calm and carry on: scaling your org with microservices

Keep calm and carry on: scaling your org with microservices

from ddd exchange in london, april 2018


Charity Majors

April 27, 2018


  1. Keep Calm and Carry On: Scaling Your Org With Microservices

    Charity Majors, @mipsytipsy
  2. Keep Calm and Carry On: Scaling Your Org With Microservices

    Charity Majors, @mipsytipsy
  3. @mipsytipsy engineer/cofounder/CEO “the only good diff is a red diff”

  4. @mipsytipsy engineer, cofounder, CEO

  5. What even is a microservice (No one knows)

  6. What are microservices? • Monorepo — sometimes • Independently deployable,

    small modular services • Decentralized governance • Small teams, up to maybe a dozen people • Operating independently, interacting with other teams via APIs
  7. Naming a thing has power convention design patterns community best

    practices vetted in multiple environments “hey, haven’t we been doing this forever?”
  8. microservices(n) a real-world application of distributed systems engineering principles to

    software architecture
  9. Welcome to distributed systems. it’s probably fine. (it might be

  10. “Dear Twitter …”

  11. Architectural complexity Parse, 2015 LAMP stack, 2005

  12. monitoring => observability known unknowns => unknown unknowns

  13. Many catastrophic states exist at any given time. Your system

    is never entirely ‘up’
  14. Welcome to distributed systems. Everything fails. All the time.

  15. “Complexity is increasing” - Science

  16. Monolith Microservices

  17. You need: a new mindset new habits new tools a

    sense of humor forgiveness :) remember … you are still an early adopter!
  18. You need: a new mindset new habits new tools a

    sense of humor forgiveness :) remember … you are still an early adopter!
  19. devs & ops => software owners monitoring => observability staging

    => test in prod availability => resiliency aggregation => sampling … all your communication … your entire org structure What changes?
  20. Software needs owners. Not operators, not developers Owners have impact

    on the full lifecycle of their software: build, fix, listen, patch, commit, deploy, revert, rollback, instrument, understand, anticipate, verify, validate. devs & ops => software owners
  21. None
  22. and from a DBA at a different company … …

  23. None
  24. The most powerful weapon in your arsenal is always cause

    and effect. Engineers should be on call for their own services.
  25. • Guard your people’s time and sleep • No hero

    complexes. No martyrs. • Don’t over-page. Align engineering pain with customer pain • Roll up non-urgent alerts for daytime hours • Your most valuable paging alerts are end-to-end checks on critical code paths. Corollary: on-call must not be hell.
  26. Probe every software engineering candidate for their ops experience &

    attitude. … yep, even FE/mobile devs!
  27. “Operations is valued here.” you are signaling …

  28. Senior software engineers should be reasonably good at these things.

    So if they are not, don’t promote them. Operations engineering is about making systems maintainable, reliable, and comprehensible.
  29. staging => test in prod

  30. None
  31. Distributed systems are particularly hostile to being cloned or imitated.

    (clients, concurrency, chaotic traffic patterns, edge cases …) These systems have an infinitely long list of almost-impossible failure scenarios that make staging copies particularly worthless. this is a black hole for engineering time
  32. unit tests integration tests functional tests basic failover test before

    prod: … the basics. the simple stuff.
  33. behavioral tests experiments load tests (!!) edge cases canaries rolling

    deploys multi-region test in prod: … where shit gets real.
  34. That energy is better used elsewhere: Production. You can catch

    80% of the bugs with 20% of the effort. And you should. @caitie’s PWL talk: https://youtu.be/-3tw2MYYT0Q
  35. feature flags (launch darkly high cardinality tooling (honeycomb) gate your

    releases () canaries, shadow systems (goturbine, linkerd) capture/replay for databases (apiary, percona) also build or use: jk dont build your own
  36. Failure is not rare Practice shipping and fixing lots of

    small problems And practice on your users!!
  37. Does everyone … know what normal looks like? know how

    to deploy? know how to roll back? know how to canary? know how to debug in production? Practice!!~
  38. None
  39. Failure: it’s “when”, not “if” (lots and lots and lots

    of “when’s”)
  40. 1. Canarying. Automated canarying. Promotion of canaries. 2. Making deploys

    more automated and robust 3. Making the fastest path the correctest/safest path 4. Limiting the critical path. Limiting the blast radius. 5. Shipping features behind feature flags 6. Making rollbacks just another boring deploy 7. Instrumentation. Good defaults. Test on employees. Your allies: These are *always* a good use of your time. (Staging is *sometimes* a good use of your time)
  41. Why do people sink so much time into staging, when

    they can’t even tell if their own production environment is healthy or not?
  42. You have an observable system when your team can quickly

    and reliably track down any new problem in real time.. monitoring => observability
  43. Observability “In control theory, observability is a measure of how

    well internal states of a system can be inferred from knowledge of its external outputs. The observability and controllability of a system are mathematical duals." — wikipedia … translate??!?
  44. Observability Can you understand what’s happening inside your code and

    systems, simply by asking questions using your tools? Can you answer any new question you think of, or only the ones you prepared for? Having to ship new code every time you want to ask a new question … SUCKS.
  45. The app tier capacity is exceeded. Maybe we rolled out

    a build with a perf regression, or maybe some app instances are down. DB queries are slower than normal. Maybe we deployed a bad new query, or there is lock contention. Errors or latency are high. We will look at several dashboards that reflect common root causes, and one of them will show us why. “Photos are loading slowly for some people. Why?” Monitoring (LAMP stack) monitor these things
  46. Characteristics Monitoring • Known-unknowns predominate • Intuition-friendly • Dashboards are

    valuable. • Monolithic app, single data source. • The health of the system more or less accurately represents the experience of the individual users. (LAMP stack)
  47. Best Practices Monitoring • Lots of actionable active checks and

    alerts • Proactively notify engineers of failures and warnings • Maintain a runbook for stable production systems • Rely on clusters and clumps of tightly coupled systems all breaking at once
  48. “Photos are loading slowly for some people. Why?” (microservices) Any

    microservices running on c2.4xlarge instances and PIOPS storage in us-east-1b has a 1/20 chance of running on degraded hardware, and will take 20x longer to complete for requests that hit the disk with a blocking call. This disproportionately impacts people looking at older archives due to our fanout model. Canadian users who are using the French language pack on the iPad running iOS 9, are hitting a firmware condition which makes it fail saving to local cache … which is why it FEELS like photos are loading slowly Our newest SDK makes db queries sequentially if the developer has enabled an optional feature flag. Working as intended; the reporters all had debug mode enabled. But flag should be renamed for clarity sake. wtf do i ‘monitor’ for?! Monitoring?!?
  49. These are all unknown-unknowns that may have never happened before,

    or ever happen again (They are also the overwhelming majority of what you have to care about for the rest of your life.)
  50. Characteristics • Unknown-unknowns are most of the problems • “Many”

    components and storage systems • You cannot model the entire system in your head. Dashboards may be actively misleading. • The hardest problem is often identifying which component(s) to debug or trace. • The health of the system is irrelevant. The health of each individual request is of supreme consequence. (microservices/complex systems) Observability
  51. Best Practices • Rich instrumentation. • Events, not metrics. •

    Sampling, not write-time aggregation. • Few (if any) dashboards. • Test in production.. a lot. • Very few paging alerts. Observability (microservices/complex systems)
  52. Why: Instrumentation? Events, not metrics? No dashboards? Sampling, not time

    series aggregation? Test in production? Fewer alerts?
  53. 8 commandments for a Glorious Future™ well-instrumented high cardinality high

    dimensionality event-driven structured well-owned sampled tested in prod.
  54. Instrumentation? Start at the edge and work down Internal state

    from software you didn’t write, too Wrap every network call, every data call Structured data only `gem install` magic will only get you so far
  55. Events, not metrics? (trick question.. you’ll need both but you’ll

    rely on events more and more) Cardinality Context Structured data
  56. UUIDs db raw queries normalized queries comments firstname, lastname PID/PPID

    app ID device ID HTTP header type build ID IP:port shopping cart ID userid ... etc Some of these … might be … useful … YA THINK??! High cardinality will save your ass. Metrics (cardinality)
  57. You must be able to break down by 1/millions and

    THEN by anything/everything else High cardinality is not a nice-to-have ‘Platform problems’ are now everybody’s problems
  58. Events tell stories. Arbitrarily wide events mean you can amass

    more and more context over time. Use sampling to control costs and bandwidth. Structure your data at the source to reap massive efficiencies over strings. Events (“Logs” are just a transport mechanism for events)
  59. Dashboards

  60. Raw Fast Iterative Interactive Exploratory

  61. Dashboard overuse must die Unknown-unknowns demand explorability and an open

  62. sampling, not aggregation Raw requests:

  63. Aggregation is a one-way trip Destroying raw events eliminates your

    ability to ask new questions. Forever. Aggregates are the devil
  64. Aggregates destroy your precious details. You need MORE detail and

    MORE context. Aggregates
  65. availability => resiliency Shrink the critical path Automatedly remediate Invest

    in canaries Build exploratory, open-ended introspection Observability > *
  66. Software needs owners. Not operators, not developers Owners have impact

    on the full lifecycle of their software: build, fix, listen, patch, commit, deploy, revert, rollback, instrument, understand, anticipate, verify, validate. aggregation => sampling
  67. … all your communication @mranney, Uber “With microservices, you cleverly

    swap out your technical problems for political problems.”
  68. Deploys On-Call Pull requests, arch reviews Observability Code is communication.

  69. Deploys

  70. Deploys must be: • Fast. Rolling. Roll-back-able. • Reliable. Breaks

    rarely. • Draws a tagged vertical line in graphs. • *Anyone* should be able to invoke deploy • For bonus points: canarying or automated
  71. Revisit these tools regularly. part of every post mortem.

  72. None
  73. (what the actual fuck? do it anyway.)

  74. most outages are triggered by “events”, from humans. draw a

  75. None
  76. … your entire org structure @mranney, Uber “With microservices, you

    cleverly swap out your technical problems for political problems.”
  77. None
  78. embrace the chaos seek resiliency

  79. Conway’s “Law”

  80. Conway’s Law, post-Jobs

  81. “Conway’s Law” is not a law

  82. Hard things are hard. don’t do them if you don’t

    have to!
  83. Microservices are about changes.

  84. seek feedback move forward <3 change is the only constant

  85. Choose the problems you are not going to solve, or

    they will choose you.
  86. Yes but …. Yes, microservices helps you drift a little

    bit and innovate independently … BUT, not as much as you might think. You all still share a fabric, after all. Stateful still gonna ruin your party. (and IPC, sec discovery, caching, cd pipelines, databases etc.)
  87. References: Conway’s Law Swap tech problems for political Multiple repos

    http://blog.christianposta.com/microservices/youre-not-going-to-do-microservices/ Terrific talks by @aspyker, @adrianco, @samnewman, @martinfowler, @mattranney, etc: https://medium.facilelogin.com/ten-talks-on-microservices-you-cannot-miss-at-any- cost-7bbe5ab7f43f#.qqzeqpw2l https://www.infoq.com/presentations/7-sins-microservices
  88. Charity Majors @mipsytipsy