RebelCon 2019: Mature microservices and how to operate them

RebelCon 2019: Mature microservices and how to operate them

At the Financial Times, we built our first microservices in 2013. We like a microservices-based approach, because by breaking up the system into lots of independently deployable services - making releases small, quick and reversible - we can deliver more value, more quickly, to our customers and we can run hundreds of experiments a year.

This approach has had a big - and positive - impact on our culture. However, it is much more challenging to operate.

So how do we go about building stable, resilient systems from microservices? And how do we make sure we can fix any problems as quickly as possible?

I'll talk about building necessary operational capabilities in from the start: how monitoring can help you work out when something has gone wrong and how observability tools like log aggregation, tracing and metrics can help you fix it as quickly as possible.

We've also now being building microservice architectures for long enough to start to hit a whole new set of problems. Projects finish and teams move on to another part of the system, or maybe an entirely new system. So how do we reduce the risk of big issues happening once the team gets smaller and there start to be services that no-one in the team has ever touched?

The next legacy systems are going to be microservices, not monoliths, and you need to be working now to prevent that causing a lot of pain in the future.

A288fb976fc633cde90a2bc19bf2b5a6?s=128

Sarah Wells

June 20, 2019
Tweet

Transcript

  1. Mature microservices and how to operate them Sarah Wells Technical

    Director for Operations & Reliability, The Financial Times @sarahjwells
  2. None
  3. @sarahjwells https://www.ft.com/stream/ c47f4dfc-6879-4e95-accf-ca8cbe6a1f69

  4. @sarahjwells https://www.ft.com/companies

  5. @sarahjwells Problem: we’d set up a redirect to a page

    which didn’t exist
  6. None
  7. None
  8. @sarahjwells Using the right tool for the job is great

    - until you need to work out how *this* database is backed up
  9. None
  10. None
  11. None
  12. @sarahjwells Microservices are more complicated to operate and maintain

  13. @sarahjwells Why bother?

  14. None
  15. None
  16. @sarahjwells “Experiment” for most organizations really means “try” Linda Rising

    Experiments: the Good, the Bad and the Beautiful
  17. Overlap tests by componentising the barrier

  18. @sarahjwells Releasing changes frequently doesn’t just ‘happen’

  19. @sarahjwells Done right, microservices enable this

  20. @sarahjwells What happens when teams move on to new projects?

  21. @sarahjwells Your next legacy system will be microservices not a

    monolith
  22. @sarahjwells Optimising for speed Operating microservices When people move on

  23. @sarahjwells Optimising for speed

  24. None
  25. Measure High performers Delivery lead time data from Accelerate: Forsgren,

    Humble, Kim
  26. Measure High performers Delivery lead time Less than one hour

    data from Accelerate: Forsgren, Humble, Kim
  27. Measure High performers Delivery lead time Less than one hour

    Deployment frequency data from Accelerate: Forsgren, Humble, Kim
  28. Measure High performers Delivery lead time Less than one hour

    Deployment frequency On demand data from Accelerate: Forsgren, Humble, Kim
  29. Measure High performers Delivery lead time Less than one hour

    Deployment frequency On demand Time to restore service data from Accelerate: Forsgren, Humble, Kim
  30. Measure High performers Delivery lead time Less than one hour

    Deployment frequency On demand Time to restore service Less than one hour data from Accelerate: Forsgren, Humble, Kim
  31. Measure High performers Delivery lead time Less than one hour

    Deployment frequency On demand Time to restore service Less than one hour Change fail rate data from Accelerate: Forsgren, Humble, Kim
  32. Measure High performers Delivery lead time Less than one hour

    Deployment frequency On demand Time to restore service Less than one hour Change fail rate 0 - 15% data from Accelerate: Forsgren, Humble, Kim
  33. @sarahjwells High performing organisations release changes frequently

  34. @sarahjwells Continuous delivery is the foundation

  35. “If it hurts, do it more frequently, and bring the

    pain forward.”
  36. @sarahjwells Our old build and deployment process was very manual…

  37. None
  38. @sarahjwells You can’t experiment when you do 12 releases a

    year
  39. @sarahjwells What does continuous delivery involve?

  40. @sarahjwells 1. An automated build and release pipeline

  41. @sarahjwells 2. Automated testing, integrated into the pipeline

  42. @sarahjwells 3. Continuous integration

  43. @sarahjwells If you aren’t releasing multiple times a day, consider

    what is stopping you
  44. @sarahjwells You’ll probably have to change the way you architect

    things
  45. @sarahjwells Zero downtime deployments: - sequential deployments - schemaless databases

  46. @sarahjwells You need to be able to test and deploy

    your changes independently
  47. @sarahjwells You need systems - and teams - to be

    loosely coupled
  48. @sarahjwells Done right, microservices are loosely coupled

  49. @sarahjwells Processes also have to change

  50. @sarahjwells Often there is ‘process theatre’ around things and this

    can safely be removed
  51. @sarahjwells Change approval boards don’t reduce the chance of failure

  52. @sarahjwells Filling out a form for each change takes too

    long
  53. @sarahjwells How often do we release code at the FT?

  54. Content platform releases, 2017

  55. Content platform releases, 2014

  56. @sarahjwells Releasing 250 times as often

  57. @sarahjwells Changes are small, easy to understand, independent and reversible

  58. <1% failure rate ~16% failure rate

  59. @sarahjwells Optimising for speed Operating microservices

  60. None
  61. @sarahjwells There are patterns and approaches that help

  62. @sarahjwells Devops is essential for success

  63. @sarahjwells The team that builds the system *has* to operate

    it too
  64. @sarahjwells You can’t hand things off to another team when

    they change multiple times a day
  65. @sarahjwells High performing teams get to make their own decisions

    about tools and technology
  66. @sarahjwells Delegating tool choice to teams makes it hard for

    central teams to support everything
  67. @sarahjwells Make it someone else’s problem

  68. https://medium.com/wardleymaps

  69. @sarahjwells Buy rather than build, unless it’s critical to your

    business
  70. @sarahjwells Work out what level of risk you’re comfortable with

  71. @sarahjwells “We’re not a hospital or a power station”

  72. @sarahjwells We value releasing often so we can experiment frequently

  73. @sarahjwells Accept that you will generally be in a state

    of ‘grey failure’
  74. None
  75. @sarahjwells Retry on failure: - backoff before retrying - give

    up if it’s taking too long
  76. @sarahjwells Mitigate now, fix tomorrow

  77. @sarahjwells How do you know something’s wrong?

  78. @sarahjwells Concentrate on the business capabilities

  79. @sarahjwells Synthetic monitoring

  80. None
  81. None
  82. None
  83. None
  84. @sarahjwells Also helps us know things are broken even if

    no user is currently doing anything
  85. @sarahjwells Make sure you know whether *real* things are working

    in production
  86. @sarahjwells Our editorial team is inventive

  87. @sarahjwells What does it mean for a publish to be

    ‘successful’?
  88. None
  89. None
  90. None
  91. None
  92. @sarahjwells Build observability into your system

  93. @sarahjwells Observability: can you infer what’s going on in the

    system by looking at its external outputs?
  94. @sarahjwells Log aggregation

  95. None
  96. @sarahjwells Metrics

  97. @sarahjwells Keep it simple: - request rate - error rate

    - duration
  98. @sarahjwells You’ll always be migrating *something*

  99. @sarahjwells Doing anything 150 times is painful

  100. @sarahjwells Deployment pipelines need to be templated

  101. @sarahjwells Use a service mesh

  102. @sarahjwells You’ll have services that haven’t been released for years

  103. @sarahjwells Build everything overnight?

  104. @sarahjwells Optimising for speed Operating microservices When people move on

  105. @sarahjwells Every system must be owned

  106. @sarahjwells If you won’t invest enough to keep it running

    properly, shut it down
  107. @sarahjwells Keeping documentation up to date is a challenge

  108. @sarahjwells We started with a searchable runbook library

  109. @sarahjwells System codes are very helpful

  110. @sarahjwells We needed to represent this stuff as a graph

  111. None
  112. None
  113. None
  114. @sarahjwells Helps if you can give people something in return

  115. None
  116. None
  117. @sarahjwells runbooks.md

  118. None
  119. None
  120. @sarahjwells Practice

  121. “If it hurts, do it more frequently, and bring the

    pain forward.”
  122. @sarahjwells Failovers, database restores

  123. @sarahjwells Chaos engineering https://principlesofchaos.org/

  124. @sarahjwells Understand your steady state Look at what you can

    change - minimise the blast radius Work out what you expect to see happen Run the experiment and see if you were right
  125. @sarahjwells Wrapping up…

  126. @sarahjwells Building and operating microservices is hard work

  127. @sarahjwells You have to maintain knowledge of services that are

    live
  128. @sarahjwells Plan now for the future of legacy microservices

  129. @sarahjwells Remember: it’s all about the business value of moving

    fast
  130. @sarahjwells Thank you!