Upgrade to Pro — share decks privately, control downloads, hide ads and more …

QCon London 2019: Mature microservices and how to operate them

QCon London 2019: Mature microservices and how to operate them

At the Financial Times, we built our first microservices in 2013. We like a microservices-based approach, because by breaking up the system into lots of independently deployable services - making releases small, quick and reversible - we can deliver more value, more quickly, to our customers and we can run hundreds of experiments a year.

This approach has had a big - and positive - impact on our culture. However, it is much more challenging to operate.

So how do we go about building stable, resilient systems from microservices? And how do we make sure we can fix any problems as quickly as possible?

I'll talk about building necessary operational capabilities in from the start: how monitoring can help you work out when something has gone wrong and how observability tools like log aggregation, tracing and metrics can help you fix it as quickly as possible.

We've also now being building microservice architectures for long enough to start to hit a whole new set of problems. Projects finish and teams move on to another part of the system, or maybe an entirely new system. So how do we reduce the risk of big issues happening once the team gets smaller and there start to be services that no-one in the team has ever touched?

The next legacy systems are going to be microservices, not monoliths, and you need to be working now to prevent that causing a lot of pain in the future.

Sarah Wells

March 04, 2019
Tweet

More Decks by Sarah Wells

Other Decks in Technology

Transcript

  1. Mature microservices and how to operate them Sarah Wells Technical

    Director for Operations & Reliability, The Financial Times @sarahjwells
  2. @sarahjwells Polyglot architectures are great - until you need to

    work out how *this* database is backed up
  3. Measure High performers Delivery lead time Less than one hour

    “How long would it take you to release a single line of code to production?”
  4. Measure High performers Delivery lead time Less than one hour

    Deployment frequency On demand Time to restore service
  5. Measure High performers Delivery lead time Less than one hour

    Deployment frequency On demand Time to restore service Less than one hour
  6. Measure High performers Delivery lead time Less than one hour

    Deployment frequency On demand Time to restore service Less than one hour Change fail rate
  7. Measure High performers Delivery lead time Less than one hour

    Deployment frequency On demand Time to restore service Less than one hour Change fail rate 0 - 15%
  8. @sarahjwells Also helps us know things are broken even if

    no user is currently doing anything
  9. @sarahjwells Observability: can you infer what’s going on in the

    system by looking at its external outputs?
  10. @sarahjwells But you don’t want to find out your service

    can’t be released when you most need to do it
  11. @sarahjwells Understand your steady state Look at what you can

    change - minimise the blast radius Work out what you expect to see happen Run the experiment and see if you were right