An Engineer's Guide to Good Nights Sleep

An Engineer's Guide to Good Nights Sleep

An updated version of my talk about practices to avoid that dreaded 3am call.

As organisations look to empower engineers more, and embrace DevOps practices, we have seen the support role change quite a bit too. Developers are moving from being purely third line support, to working more collaboratively with engineers and operational staff. Also as we move to cloud native microservice solutions, the increased complexity and diversity of our production landscape means operational staff may well rely more heavily on the engineers, in particular out of hours.

I have spent the last 18 years working across a plethora of industries utilising a myriad of technology and approaches. From working on everything from trading applications to content enrichment APIs, I have seen a lot of approaches and processes try to help minimise operational support for developers.

In this talk, I will be exploring and discussing some of my top approaches and techniques to help reduce the risk of that dreaded 3am call! You will gain some practical insight into how to handle failure in today's more complex distributed microservice systems. This will include looking at approaches to resiliency, understanding your system, understanding the requirements for fault tolerance, and the developers' mindset necessary for this. I will be peppering this talk with real world examples, and an occasional war story along the way too.

5eba81d891f309e10028977674aeb3a6?s=128

Nicky Wrightson

September 16, 2019
Tweet

Transcript

  1. @nickywrightson An Engineer’s Guide to a Good Night’s Sleep By

    Nicky Wrightson @nickywrightson
  2. @nickywrightson

  3. @nickywrightson

  4. @nickywrightson

  5. @nickywrightson We are building REALLY complicated distributed systems

  6. @nickywrightson Martin Fowler https://martinfowler.com/articles/microservice-trade-offs.html “You need a mature operations team

    to manage lots of services, which are being redeployed regularly”
  7. @nickywrightson

  8. @nickywrightson Empowered teams means the team also control the support

  9. @nickywrightson 2014 Consumers add a caching layer to protect against

    our outages 2019 Out of hours calls to 3rd line have all but disappeared 2018 Migration to Kubernetes completed 2017 Our services were given an SLA of 15mins recovery time
  10. @nickywrightson Approaches to reduce the risk of being called 5

  11. @nickywrightson Engineer’s mindset 1

  12. @nickywrightson 1

  13. @nickywrightson Enable teams to own their own support models 1

  14. @nickywrightson Operations Support Team A Support Team B 1

  15. @nickywrightson The team triages issues during the day 1

  16. @nickywrightson Engineers need to think about that out of hours

    call with every error condition 1
  17. @nickywrightson Design the severity levels within your service 1

  18. @nickywrightson Engineer’s mindset 1

  19. @nickywrightson Don’t get called for issues that could have been

    caught in office hours 2
  20. @nickywrightson Releases during the day should never wake you up

    at night 2
  21. @nickywrightson Can our deployment times help this? 2

  22. @nickywrightson Quick deployment 2

  23. @nickywrightson 2 Get your deployment system do automatic rollbacks

  24. @nickywrightson 2 VERIFY VERIFY VERIFY

  25. @nickywrightson By Cindy Sridharan (@copyconstruct) 2

  26. @nickywrightson 3am batch jobs are a guarantee to get an

    overnight call at some point 2
  27. @nickywrightson 2

  28. @nickywrightson 2

  29. @nickywrightson Don’t get called for issues that could have been

    caught in office hours 2
  30. @nickywrightson Automate failure recovery where possible 3

  31. @nickywrightson Let your platform recover for you 3

  32. @nickywrightson Applications need to cope with change Graceful Termination Transactional

    Clean restarts Stateless Queue Backed Idempotent 3
  33. @nickywrightson Multi region automatic system failovers 3

  34. @nickywrightson 3

  35. @nickywrightson Multi region automatic system failovers 3

  36. @nickywrightson Healthchecks and liveness probes may not tell the whole

    story
  37. @nickywrightson

  38. @nickywrightson Automate failure recovery where possible 3

  39. @nickywrightson Understand what your customers really care about 4

  40. @nickywrightson You want to be the first to know about

    a critical failure 4
  41. @nickywrightson “Only have alerts that you need to action” Sarah

    Wells - Director of Operations and Reliability at FT 4
  42. @nickywrightson Synthetic Requests 4

  43. @nickywrightson Use tracing to monitor your critical flows 4 Ben

    Sigelman Restoring Confidence in Microservices: Tracing That's More Than Traces
  44. @nickywrightson 4

  45. @nickywrightson 4

  46. @nickywrightson 4

  47. @nickywrightson 4

  48. @nickywrightson We are now flagging important events close to the

    code 4
  49. @nickywrightson Understand what your customers really care about 4

  50. @nickywrightson Break things and practice everything 5

  51. @nickywrightson “a method of experimenting on infrastructure that lets you

    expose weaknesses before they become a real problem.” 5
  52. @nickywrightson Monolith to microservice timeline 5

  53. @nickywrightson When can we release the chaos monkeys? 5

  54. @nickywrightson Manual simulation of outages work too 5

  55. @nickywrightson Spot the SPOF 5

  56. @nickywrightson Multi region automatic system failovers 5

  57. @nickywrightson Multi region automatic system failovers 5

  58. @nickywrightson Fixing things in hours helps team confidence to support

    out of hours 5
  59. @nickywrightson Manual intervention should be simple FIX IT! 5

  60. @nickywrightson 5

  61. @nickywrightson Make sure your alerts have all the relevant information

    to action the event 5
  62. @nickywrightson Failed requests 5

  63. @nickywrightson At 3am just get the system to limp into

    hours 5
  64. @nickywrightson Break things and practice everything 5

  65. @nickywrightson Engineer’s mindset 1

  66. @nickywrightson Don’t get called for issues that could have been

    caught in office hours 2
  67. @nickywrightson Automate failure recovery where possible 3

  68. @nickywrightson Understand what your customers care about? 4

  69. @nickywrightson Break things and practice everything 5

  70. @nickywrightson The engineers are the ones called at 3am We

    now own this!
  71. @nickywrightson Thanks! https://speakerdeck.com/nickywrightson https://grnh.se/579803f21

  72. @nickywrightson Resources Testing Microservices, the sane way by Cindy Sridharan

    https://medium.com/@copyconstruct/testing-microservices-the-sane-way-9bb31d158c16 Microservices trade offs by Martin Fowler https://martinfowler.com/articles/microservice-trade-offs.html Ben Sigelman @ QCon 2019 https://www.infoq.com/presentations/microservices-distributed-tracing? itm_source=infoq&itm_medium=QCon_EarlyAccessVideos&itm_campaign=QConLondon2019 James Governor on progressive delivery: https://redmonk.com/jgovernor/2018/08/06/towards- progressive-delivery/ Chaity Majors on Friday freezes: https://charity.wtf/2019/05/01/friday-deploy-freezes-are-exactly-like- murdering-puppies/