Upgrade to Pro — share decks privately, control downloads, hide ads and more …

An Engineer's Guide to a Good Night's Sleep

An Engineer's Guide to a Good Night's Sleep

As organisations look to empower engineers more, and embrace devops practices, we have seen the support role change quite a bit too. Developers are moving from being purely third line support, to working more collaboratively with engineers and operational staff. Also as we move to cloud native microservice solutions, the increased complexity and diversity of our production landscape means operational staff may well rely more heavily on the engineers, in particular out of hours.

I have spent the last 18 years working across a plethora of industries utilising a myriad of technology and approaches. From working on everything from trading applications to content enrichment APIs, I have seen a lot of approaches and processes try to help minimise operational support for developers.

In this talk, I will be exploring and discussing some of my top approaches and techniques to help reduce the risk of that dreaded 3am call! You will gain some practical insight into how to handle failure in today's more complex distributed microservice systems. This will include looking at approaches to resiliency, understanding your system, understanding the requirements for fault tolerance, and the developers' mindset necessary for this. I will be peppering this talk with real world examples, and an occasional war story along the way too.

Nicky Wrightson

March 06, 2019
Tweet

More Decks by Nicky Wrightson

Other Decks in Programming

Transcript

  1. @nickywrightson Martin Fowler https://martinfowler.com/articles/microservice-trade-offs.html “You need a mature operations team

    to manage lots of services, which are being redeployed regularly” https://martinfowler.com/articles/microservice-trade-offs.html
  2. @nickywrightson 2014 Consumers add a caching layer to protect against

    our outages 2019 Out of hours calls to 3rd line have all but disappeared 2018 Migration to Kubernetes completed 2017 Our services were given an SLA of 15mins recovery time
  3. @nickywrightson “The quality of a system will appear to be

    declining unless it is rigorously maintained” Lehmans Laws of Software Evolution “Declining Quality” (1996) 1
  4. @nickywrightson As system evolves, its complexity increases unless work is

    done to maintain or reduce it Lehmans Laws of Software Evolution cont. "Increasing Complexity" (1974) 1
  5. @nickywrightson “Only have alerts that you need to action” Sarah

    Wells - Director of Operations and Reliability at FT 4
  6. @nickywrightson Service that cleans old images from the repo Service

    that takes payments Not all services are equal != 4
  7. @nickywrightson “a method of experimenting on infrastructure that lets you

    expose weaknesses before they become a real problem.” 5
  8. @nickywrightson Resources Testing Microservices, the sane way by Cindy Sridharan

    https://medium.com/@copyconstruct/testing-microservices-the-sane- way-9bb31d158c16 Microservices trade offs by Martin Fowler https://martinfowler.com/articles/microservice-trade-offs.html https://medium.com/netflix-techblog/vizceral-open-source-acc0c32113fe