Upgrade to Pro — share decks privately, control downloads, hide ads and more …

RebelCon 2019: Mature microservices and how to operate them

RebelCon 2019: Mature microservices and how to operate them

At the Financial Times, we built our first microservices in 2013. We like a microservices-based approach, because by breaking up the system into lots of independently deployable services - making releases small, quick and reversible - we can deliver more value, more quickly, to our customers and we can run hundreds of experiments a year.

This approach has had a big - and positive - impact on our culture. However, it is much more challenging to operate.

So how do we go about building stable, resilient systems from microservices? And how do we make sure we can fix any problems as quickly as possible?

I'll talk about building necessary operational capabilities in from the start: how monitoring can help you work out when something has gone wrong and how observability tools like log aggregation, tracing and metrics can help you fix it as quickly as possible.

We've also now being building microservice architectures for long enough to start to hit a whole new set of problems. Projects finish and teams move on to another part of the system, or maybe an entirely new system. So how do we reduce the risk of big issues happening once the team gets smaller and there start to be services that no-one in the team has ever touched?

The next legacy systems are going to be microservices, not monoliths, and you need to be working now to prevent that causing a lot of pain in the future.

Sarah Wells

June 20, 2019
Tweet

More Decks by Sarah Wells

Other Decks in Technology

Transcript

  1. Mature microservices and how to operate them Sarah Wells Technical

    Director for Operations & Reliability, The Financial Times @sarahjwells
  2. @sarahjwells Using the right tool for the job is great

    - until you need to work out how *this* database is backed up
  3. Measure High performers Delivery lead time Less than one hour

    data from Accelerate: Forsgren, Humble, Kim
  4. Measure High performers Delivery lead time Less than one hour

    Deployment frequency data from Accelerate: Forsgren, Humble, Kim
  5. Measure High performers Delivery lead time Less than one hour

    Deployment frequency On demand data from Accelerate: Forsgren, Humble, Kim
  6. Measure High performers Delivery lead time Less than one hour

    Deployment frequency On demand Time to restore service data from Accelerate: Forsgren, Humble, Kim
  7. Measure High performers Delivery lead time Less than one hour

    Deployment frequency On demand Time to restore service Less than one hour data from Accelerate: Forsgren, Humble, Kim
  8. Measure High performers Delivery lead time Less than one hour

    Deployment frequency On demand Time to restore service Less than one hour Change fail rate data from Accelerate: Forsgren, Humble, Kim
  9. Measure High performers Delivery lead time Less than one hour

    Deployment frequency On demand Time to restore service Less than one hour Change fail rate 0 - 15% data from Accelerate: Forsgren, Humble, Kim
  10. @sarahjwells Also helps us know things are broken even if

    no user is currently doing anything
  11. @sarahjwells Observability: can you infer what’s going on in the

    system by looking at its external outputs?
  12. @sarahjwells Understand your steady state Look at what you can

    change - minimise the blast radius Work out what you expect to see happen Run the experiment and see if you were right