QCon London 2019: Mature microservices and how to operate them

QCon London 2019: Mature microservices and how to operate them

At the Financial Times, we built our first microservices in 2013. We like a microservices-based approach, because by breaking up the system into lots of independently deployable services - making releases small, quick and reversible - we can deliver more value, more quickly, to our customers and we can run hundreds of experiments a year.

This approach has had a big - and positive - impact on our culture. However, it is much more challenging to operate.

So how do we go about building stable, resilient systems from microservices? And how do we make sure we can fix any problems as quickly as possible?

I'll talk about building necessary operational capabilities in from the start: how monitoring can help you work out when something has gone wrong and how observability tools like log aggregation, tracing and metrics can help you fix it as quickly as possible.

We've also now being building microservice architectures for long enough to start to hit a whole new set of problems. Projects finish and teams move on to another part of the system, or maybe an entirely new system. So how do we reduce the risk of big issues happening once the team gets smaller and there start to be services that no-one in the team has ever touched?

The next legacy systems are going to be microservices, not monoliths, and you need to be working now to prevent that causing a lot of pain in the future.

A288fb976fc633cde90a2bc19bf2b5a6?s=128

Sarah Wells

March 04, 2019
Tweet

Transcript

  1. Mature microservices and how to operate them Sarah Wells Technical

    Director for Operations & Reliability, The Financial Times @sarahjwells
  2. None
  3. @sarahjwells https://www.ft.com/stream/ c47f4dfc-6879-4e95-accf-ca8cbe6a1f69

  4. @sarahjwells https://www.ft.com/companies

  5. @sarahjwells Problem: we’d set up a redirect to a page

    which didn’t exist
  6. @sarahjwells We weren’t sure how to fix the data via

    the url management tool
  7. None
  8. None
  9. @sarahjwells We got it fixed

  10. @sarahjwells Polyglot architectures are great - until you need to

    work out how *this* database is backed up
  11. None
  12. None
  13. None
  14. @sarahjwells Microservices are more complicated to operate and maintain

  15. @sarahjwells Why bother?

  16. None
  17. None
  18. @sarahjwells “Experiment” for most organizations really means “try” Linda Rising

    Experiments: the Good, the Bad and the Beautiful
  19. Overlap tests by componentising the barrier

  20. @sarahjwells Releasing changes frequently doesn’t just ‘happen’

  21. @sarahjwells Done right, microservices enable this

  22. @sarahjwells The team that builds the system *has* to operate

    it too
  23. @sarahjwells What happens when teams move on to new projects?

  24. @sarahjwells Your next legacy system will be microservices not a

    monolith
  25. @sarahjwells Optimising for speed Operating microservices When people move on

  26. @sarahjwells Optimising for speed

  27. None
  28. Measure High performers Delivery lead time

  29. Measure High performers Delivery lead time Less than one hour

    “How long would it take you to release a single line of code to production?”
  30. Measure High performers Delivery lead time Less than one hour

    Deployment frequency
  31. Measure High performers Delivery lead time Less than one hour

    Deployment frequency On demand
  32. Measure High performers Delivery lead time Less than one hour

    Deployment frequency On demand Time to restore service
  33. Measure High performers Delivery lead time Less than one hour

    Deployment frequency On demand Time to restore service Less than one hour
  34. Measure High performers Delivery lead time Less than one hour

    Deployment frequency On demand Time to restore service Less than one hour Change fail rate
  35. Measure High performers Delivery lead time Less than one hour

    Deployment frequency On demand Time to restore service Less than one hour Change fail rate 0 - 15%
  36. @sarahjwells High performing organisations release changes frequently

  37. @sarahjwells Continuous delivery is the foundation

  38. “If it hurts, do it more frequently, and bring the

    pain forward.”
  39. @sarahjwells Our old build and deployment process was very manual…

  40. None
  41. @sarahjwells You can’t experiment when you do 12 releases a

    year
  42. @sarahjwells 1. An automated build and release pipeline

  43. @sarahjwells 2. Automated testing, integrated into the pipeline

  44. @sarahjwells 3. Continuous integration

  45. @sarahjwells If you aren’t releasing multiple times a day, consider

    what is stopping you
  46. @sarahjwells You’ll probably have to change the way you architect

    things
  47. @sarahjwells Zero downtime deployments: - sequential deployments - schemaless databases

  48. @sarahjwells In hours releases mean the people who can help

    are there
  49. @sarahjwells You need to be able to test and deploy

    your changes independently
  50. @sarahjwells You need systems - and teams - to be

    loosely coupled
  51. @sarahjwells Done right, microservices are loosely coupled

  52. @sarahjwells Processes also have to change

  53. @sarahjwells Often there is ‘process theatre’ around things and this

    can safely be removed
  54. @sarahjwells Change approval boards don’t reduce the chance of failure

  55. @sarahjwells Filling out a form for each change takes too

    long
  56. @sarahjwells How fast are we moving?

  57. None
  58. None
  59. @sarahjwells Releasing 250 times as often

  60. @sarahjwells Changes are small, easy to understand, independent and reversible

  61. <1% failure rate ~16% failure rate

  62. @sarahjwells Optimising for speed Operating microservices

  63. None
  64. @sarahjwells There are patterns and approaches that help

  65. @sarahjwells Devops is essential for success

  66. @sarahjwells You can’t hand things off to another team when

    they change multiple times a day
  67. @sarahjwells High performing teams get to make their own decisions

    about tools and technology
  68. @sarahjwells Delegating tool choice to teams makes it hard for

    central teams to support everything
  69. @sarahjwells Make it someone else’s problem

  70. https://medium.com/wardleymaps

  71. @sarahjwells Buy rather than build, unless it’s critical to your

    business
  72. @sarahjwells Work out what level of risk you’re comfortable with

  73. @sarahjwells “We’re not a hospital or a power station”

  74. @sarahjwells We value releasing often so we can experiment frequently

  75. @sarahjwells Accept that you will generally be in a state

    of ‘grey failure’
  76. None
  77. @sarahjwells Retry on failure: - backoff before retrying - give

    up if it’s taking too long
  78. @sarahjwells Mitigate now, fix tomorrow

  79. @sarahjwells How do you know something’s wrong?

  80. @sarahjwells Concentrate on the business capabilities

  81. @sarahjwells Synthetic monitoring

  82. None
  83. None
  84. None
  85. None
  86. @sarahjwells No data fixtures required

  87. @sarahjwells Also helps us know things are broken even if

    no user is currently doing anything
  88. @sarahjwells Make sure you know whether *real* things are working

    in production
  89. @sarahjwells Our editorial team is inventive

  90. @sarahjwells What does it mean for a publish to be

    ‘successful’?
  91. None
  92. None
  93. None
  94. None
  95. @sarahjwells Build observability into your system

  96. @sarahjwells Observability: can you infer what’s going on in the

    system by looking at its external outputs?
  97. @sarahjwells Log aggregation

  98. None
  99. @sarahjwells Metrics

  100. @sarahjwells Keep it simple: - request rate - latency -

    error rate
  101. @sarahjwells You’ll always be migrating *something*

  102. @sarahjwells Doing anything 150 times is painful

  103. @sarahjwells Deployment pipelines need to be templated

  104. @sarahjwells Use a service mesh

  105. @sarahjwells You’ll have services that haven’t been released for years

  106. @sarahjwells But you don’t want to find out your service

    can’t be released when you most need to do it
  107. @sarahjwells Build everything overnight?

  108. @sarahjwells Optimising for speed Operating microservices When people move on

  109. @sarahjwells Every system must be owned

  110. @sarahjwells If you won’t invest enough to keep it running

    properly, shut it down
  111. @sarahjwells Keeping documentation up to date is a challenge

  112. @sarahjwells We started with a searchable runbook library

  113. None
  114. @sarahjwells System codes are very helpful

  115. @sarahjwells We needed to represent this stuff as a graph

  116. None
  117. None
  118. @sarahjwells Helps if you can give people something in return

  119. None
  120. None
  121. @sarahjwells Practice

  122. “If it hurts, do it more frequently, and bring the

    pain forward.”
  123. @sarahjwells Failovers, database restores

  124. @sarahjwells Chaos engineering https://principlesofchaos.org/

  125. @sarahjwells Understand your steady state Look at what you can

    change - minimise the blast radius Work out what you expect to see happen Run the experiment and see if you were right
  126. @sarahjwells Wrapping up…

  127. @sarahjwells Building and operating microservices is hard work

  128. @sarahjwells You have to maintain knowledge of services that are

    live
  129. @sarahjwells Plan now for the future of legacy microservices

  130. @sarahjwells Remember: it’s all about the business value of moving

    fast
  131. @sarahjwells Thank you!