An Engineer's Guide to Good Nights Sleep

@nickywrightson An Engineer’s Guide to a Good Night’s Sleep By
Nicky Wrightson @nickywrightson

@nickywrightson

@nickywrightson We are building REALLY complicated distributed systems

@nickywrightson Martin Fowler https://martinfowler.com/articles/microservice-trade-offs.html “You need a mature operations team
to manage lots of services, which are being redeployed regularly”

@nickywrightson

@nickywrightson Empowered teams means the team also control the support

@nickywrightson 2014 Consumers add a caching layer to protect against
our outages 2019 Out of hours calls to 3rd line have all but disappeared 2018 Migration to Kubernetes completed 2017 Our services were given an SLA of 15mins recovery time

@nickywrightson Approaches to reduce the risk of being called 5

@nickywrightson Engineer’s mindset 1

@nickywrightson 1

@nickywrightson Enable teams to own their own support models 1

@nickywrightson Operations Support Team A Support Team B 1

@nickywrightson The team triages issues during the day 1

@nickywrightson Engineers need to think about that out of hours
call with every error condition 1

@nickywrightson Design the severity levels within your service 1

@nickywrightson Don’t get called for issues that could have been
caught in ofﬁce hours 2

@nickywrightson Releases during the day should never wake you up
at night 2

@nickywrightson Can our deployment times help this? 2

@nickywrightson Quick deployment 2

@nickywrightson 2 Get your deployment system do automatic rollbacks

@nickywrightson 2 VERIFY VERIFY VERIFY

@nickywrightson By Cindy Sridharan (@copyconstruct) 2

@nickywrightson 3am batch jobs are a guarantee to get an
overnight call at some point 2

@nickywrightson 2

@nickywrightson Automate failure recovery where possible 3

@nickywrightson Let your platform recover for you 3

@nickywrightson Applications need to cope with change Graceful Termination Transactional
Clean restarts Stateless Queue Backed Idempotent 3

@nickywrightson Multi region automatic system failovers 3

@nickywrightson 3

@nickywrightson Healthchecks and liveness probes may not tell the whole
story

@nickywrightson

@nickywrightson Understand what your customers really care about 4

@nickywrightson You want to be the ﬁrst to know about
a critical failure 4

@nickywrightson “Only have alerts that you need to action” Sarah
Wells - Director of Operations and Reliability at FT 4

@nickywrightson Synthetic Requests 4

@nickywrightson Use tracing to monitor your critical ﬂows 4 Ben
Sigelman Restoring Conﬁdence in Microservices: Tracing That's More Than Traces

@nickywrightson 4

@nickywrightson We are now ﬂagging important events close to the
code 4

@nickywrightson Understand what your customers really care about 4

@nickywrightson Break things and practice everything 5

@nickywrightson “a method of experimenting on infrastructure that lets you
expose weaknesses before they become a real problem.” 5

@nickywrightson Monolith to microservice timeline 5

@nickywrightson When can we release the chaos monkeys? 5

@nickywrightson Manual simulation of outages work too 5

@nickywrightson Spot the SPOF 5

@nickywrightson Fixing things in hours helps team conﬁdence to support
out of hours 5

@nickywrightson Manual intervention should be simple FIX IT! 5

@nickywrightson 5

@nickywrightson Make sure your alerts have all the relevant information
to action the event 5

@nickywrightson Failed requests 5

@nickywrightson At 3am just get the system to limp into
hours 5

@nickywrightson Understand what your customers care about? 4

@nickywrightson The engineers are the ones called at 3am We
now own this!

@nickywrightson Thanks! https://speakerdeck.com/nickywrightson https://grnh.se/579803f21

@nickywrightson Resources Testing Microservices, the sane way by Cindy Sridharan
https://medium.com/@copyconstruct/testing-microservices-the-sane-way-9bb31d158c16 Microservices trade offs by Martin Fowler https://martinfowler.com/articles/microservice-trade-offs.html Ben Sigelman @ QCon 2019 https://www.infoq.com/presentations/microservices-distributed-tracing? itm_source=infoq&itm_medium=QCon_EarlyAccessVideos&itm_campaign=QConLondon2019 James Governor on progressive delivery: https://redmonk.com/jgovernor/2018/08/06/towards- progressive-delivery/ Chaity Majors on Friday freezes: https://charity.wtf/2019/05/01/friday-deploy-freezes-are-exactly-like- murdering-puppies/

An Engineer's Guide to Good Nights Sleep

An Engineer's Guide to Good Nights Sleep

More Decks by Nicky Wrightson

Other Decks in Technology

Featured

Transcript