Upgrade to Pro — share decks privately, control downloads, hide ads and more …

European Test Conference 2019: Quality for 'clo...

Sarah Wells
February 14, 2019

European Test Conference 2019: Quality for 'cloud natives': what changes when your systems are complex and distributed?

The complexity in complex distributed systems isn’t in the code, it’s between the services or functions. And a lot of failures are hard to predict and maybe even hard to detect.

When your system is made up of multiple microservices or a bunch of lambdas and some queues, how do you test it? How do you even know whether it’s working the way you think it should?

Quality in these systems isn’t so much about testing up front: if you’re releasing 20 times a day, you can’t pay the cost of running full regression tests every time. You need to have a risk-based approach and focus your testing effort on the things where it really matters. And more importantly, you need to be able to quickly find out when things are going wrong, and quickly fix them.

Your production system is the only place the full complexity comes into play, so you should be doing a lot of your quality work there. Make sure you can find out about problems as early as possible and do as much ‘testing’ here as you can.

I talk about the importance of observability of your system - building in log aggregation and tracing so you can tell what’s up. I also talk about business-focussed monitoring, including synthetic monitoring.

I hope to show you why it’s worth dealing with the additional complexity of microservices over the monolithic approach of before, and give you some ideas about how to make your complex distributed systems easier to build and to run with high quality and stability.

Sarah Wells

February 14, 2019
Tweet

More Decks by Sarah Wells

Other Decks in Technology

Transcript

  1. Quality for 'cloud natives': what changes when your systems are

    complex and distributed? Sarah Wells Technical Director for Operations & Reliability, The Financial Times @sarahjwells
  2. @sarahjwells The kind of testing you do when you release

    once a month doesn’t work when you release 10 times a day
  3. @sarahjwells A 30 minute code change took 2 weeks to

    get the acceptance tests working
  4. @sarahjwells Also helps us know things are broken even if

    no user is currently doing anything
  5. @sarahjwells Understand your steady state Look at what you can

    change - minimise the blast radius Work out what you expect to see happen Run the experiment and see if you were right
  6. @sarahjwells Chaos engineering uses the same skills as exploratory testing

    - “hmm, I wonder what will happen if I do this?”
  7. @sarahjwells Focus on delivering maximum value to your users while

    minimising the times when things are broken or unavailable
  8. @sarahjwells Use synthetic monitoring Use clever monitoring Make sure logs

    are aggregated With tracing of events Practice things Chaos engineering IS exploratory testing!