European Test Conference 2019: Quality for 'cloud natives': what changes when your systems are complex and distributed?

A288fb976fc633cde90a2bc19bf2b5a6?s=47 Sarah Wells
February 14, 2019

European Test Conference 2019: Quality for 'cloud natives': what changes when your systems are complex and distributed?

The complexity in complex distributed systems isn’t in the code, it’s between the services or functions. And a lot of failures are hard to predict and maybe even hard to detect.

When your system is made up of multiple microservices or a bunch of lambdas and some queues, how do you test it? How do you even know whether it’s working the way you think it should?

Quality in these systems isn’t so much about testing up front: if you’re releasing 20 times a day, you can’t pay the cost of running full regression tests every time. You need to have a risk-based approach and focus your testing effort on the things where it really matters. And more importantly, you need to be able to quickly find out when things are going wrong, and quickly fix them.

Your production system is the only place the full complexity comes into play, so you should be doing a lot of your quality work there. Make sure you can find out about problems as early as possible and do as much ‘testing’ here as you can.

I talk about the importance of observability of your system - building in log aggregation and tracing so you can tell what’s up. I also talk about business-focussed monitoring, including synthetic monitoring.

I hope to show you why it’s worth dealing with the additional complexity of microservices over the monolithic approach of before, and give you some ideas about how to make your complex distributed systems easier to build and to run with high quality and stability.

A288fb976fc633cde90a2bc19bf2b5a6?s=128

Sarah Wells

February 14, 2019
Tweet

Transcript

  1. Quality for 'cloud natives': what changes when your systems are

    complex and distributed? Sarah Wells Technical Director for Operations & Reliability, The Financial Times @sarahjwells
  2. None
  3. None
  4. @sarahjwells “Experiment” for most organizations really means “try” Linda Rising

    Experiments: the Good, the Bad and the Beautiful
  5. @sarahjwells How quickly can you spin up an MVP?

  6. None
  7. @sarahjwells We’re able to do this because we adopted a

    cloud-native architecture
  8. “microservices (n,pl): an efficient device for transforming business problems into

    distributed transaction problems” @drsnooks
  9. @sarahjwells Distributed systems fail in new and interesting ways

  10. None
  11. None
  12. None
  13. None
  14. None
  15. None
  16. @sarahjwells We need to change how we approach quality

  17. @sarahjwells We need to test in production

  18. @sarahjwells Cloud native: an introduction Testing in production Optimising for

    fixing things fast
  19. @sarahjwells Cloud native: an introduction

  20. @sarahjwells What IS cloud native?

  21. @sarahjwells It’s definitely about “the cloud”

  22. @sarahjwells Cloud native means building things to benefit from the

    cloud not just run on it
  23. Infrastructure as a service

  24. Infrastructure as a service Automation

  25. Infrastructure as a service Continuous Delivery Automation

  26. Infrastructure as a service Microservices Continuous Delivery Automation

  27. Infrastructure as a service Microservices Containers & Orchestration Continuous Delivery

    Automation
  28. Infrastructure as a service Microservices Containers & Orchestration Software as

    a Service Continuous Delivery Automation
  29. Download at: https:// info.container- solutions.com/ introduction-to-cloud- native

  30. @sarahjwells Sounds complicated?

  31. @sarahjwells Why adopt it?

  32. @sarahjwells “Cloud native technologies enable software developers to build great

    products faster” - the CNCF
  33. @sarahjwells Making small releases, quickly and frequently

  34. None
  35. @sarahjwells You can’t experiment when you do 12 releases a

    year
  36. @sarahjwells Small changes are much easier to understand

  37. The more often you release, the lower your failure rate

    for those releases
  38. @sarahjwells ~15% failure rate vs < 1% failure rate

  39. @sarahjwells You don’t have to choose between speed and stability

  40. @sarahjwells Why does the focus for testing change?

  41. @sarahjwells The kind of testing you do when you release

    once a month doesn’t work when you release 10 times a day
  42. None
  43. None
  44. None
  45. @sarahjwells “Not wrong long” Sally Goble https://www.theguardian.com/info/developer-blog/2016/dec/04/ perfect-software-the-enemy-of-rapid-deployment

  46. @sarahjwells “We’re not a nuclear power station or a hospital”

  47. @sarahjwells Cloud native: an introduction Testing in production

  48. @sarahjwells Pre-release testing

  49. @sarahjwells We should still be writing automated tests for the

    service
  50. Cindy Sridharan: https://medium.com/@copyconstruct/ testing-microservices-the-sane-way-9bb31d158c16

  51. @sarahjwells Don’t try to regression test the whole system

  52. @sarahjwells Acceptance tests running locally pushes developers towards a ‘full

    stack on your laptop’
  53. @sarahjwells You end up with a distributed monolith

  54. @sarahjwells Test fixtures can be brittle

  55. None
  56. @sarahjwells A 30 minute code change took 2 weeks to

    get the acceptance tests working
  57. @sarahjwells Almost all the time, the code was fine, the

    tests were broken
  58. @sarahjwells Learn from the pain!

  59. @sarahjwells Shifting right?

  60. @sarahjwells Introduce synthetic monitoring

  61. @sarahjwells This replaced our acceptance tests

  62. None
  63. None
  64. None
  65. None
  66. @sarahjwells No data fixtures required

  67. @sarahjwells Also helps us know things are broken even if

    no user is currently doing anything
  68. @sarahjwells Make sure you know if things are working in

    production
  69. @sarahjwells Our editorial team is inventive

  70. @sarahjwells What does it mean for a publish to be

    ‘successful’?
  71. None
  72. None
  73. None
  74. None
  75. @sarahjwells Define a contract

  76. @sarahjwells Contract testing for key interfaces

  77. @sarahjwells Simple documentation is a start

  78. None
  79. created by Matt Hinchliffe (https://github.com/i-like-robots)

  80. None
  81. @sarahjwells Separate releasing code from releasing functionality

  82. @sarahjwells Feature flags

  83. None
  84. @sarahjwells Canary releases

  85. None
  86. Cindy Sridharan: https://medium.com/@copyconstruct/ testing-in-production-the-safe-way-18ca102d0ef1

  87. @sarahjwells Cloud native: an introduction Testing in production Optimising for

    fixing things fast
  88. @sarahjwells Mitigate first

  89. @sarahjwells Make sure you can work out what’s going on

  90. @sarahjwells Log aggregation

  91. None
  92. @sarahjwells transaction_id=“SYN_ABC”

  93. @sarahjwells Practice

  94. “If it hurts, do it more frequently, and bring the

    pain forward.”
  95. @sarahjwells Failovers, database restores

  96. @sarahjwells Chaos engineering https://principlesofchaos.org/

  97. @sarahjwells Understand your steady state Look at what you can

    change - minimise the blast radius Work out what you expect to see happen Run the experiment and see if you were right
  98. @sarahjwells Use the skills you already have

  99. @sarahjwells Good QAs understand the features of the system

  100. @sarahjwells Chaos engineering uses the same skills as exploratory testing

    - “hmm, I wonder what will happen if I do this?”
  101. @sarahjwells Work on operational stuff too

  102. @sarahjwells Cloud native: an introduction Testing in production Optimising for

    fixing things fast
  103. @sarahjwells What worked before doesn’t work so well for cloud

    native
  104. @sarahjwells Focus on delivering maximum value to your users while

    minimising the times when things are broken or unavailable
  105. @sarahjwells Understand where the QA mindset has the most impact

  106. @sarahjwells Use synthetic monitoring Use clever monitoring Make sure logs

    are aggregated With tracing of events Practice things Chaos engineering IS exploratory testing!
  107. @sarahjwells Thank you!