The complexity in complex distributed systems isn’t in the code, it’s between the services or functions. And a lot of failures are hard to predict and maybe even hard to detect.
When your system is made up of multiple microservices or a bunch of lambdas and some queues, how do you test it? How do you even know whether it’s working the way you think it should?
Quality in these systems isn’t so much about testing up front: if you’re releasing 20 times a day, you can’t pay the cost of running full regression tests every time. You need to have a risk-based approach and focus your testing effort on the things where it really matters. And more importantly, you need to be able to quickly find out when things are going wrong, and quickly fix them.
Your production system is the only place the full complexity comes into play, so you should be doing a lot of your quality work there. Make sure you can find out about problems as early as possible and do as much ‘testing’ here as you can.
I talk about the importance of observability of your system - building in log aggregation and tracing so you can tell what’s up. I also talk about business-focussed monitoring, including synthetic monitoring.
I hope to show you why it’s worth dealing with the additional complexity of microservices over the monolithic approach of before, and give you some ideas about how to make your complex distributed systems easier to build and to run with high quality and stability.